Extending MapReduce across Clouds with BStream

2168-7161 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TCC.2014.2316810, IEEE Transactions on Cloud Computing

1

Extending MapReduce across Clouds with BStreamSriram Kailasam Prateek Dhawalia S J Balaji Geeta Iyer Janakiram Dharanipragada

Distributed and Object Systems LabDept. of Comp. Sci. and Engg., IIT Madras

Chennai, India [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract—Today, batch processing frameworks like HadoopMapReduce are difficult to scale to multiple clouds due to laten-cies involved in inter-cloud data transfer and synchronizationoverheads during shuffle-phase. This inhibits the MapReduceframework from guaranteeing performance at variable loadsurges without over-provisioning in the internal cloud (IC). Wepropose BStream, a cloud bursting framework that couplesstream-processing in the external cloud (EC) with Hadoop in theinternal cloud (IC) to realize inter-cloud MapReduce. Streamprocessing in EC enables pipelined uploading, processing anddownloading of data to minimize network latencies. We usethis framework to guarantee service-level objective (SLO) ofmeeting job deadlines. BStream uses an analytical model tominimize the usage of EC and burst only when necessary.We propose different checkpointing strategies to overlap outputtransfer with input transfer/processing while simultaneouslyreducing the computation involved in merging the results fromEC and IC. Checkpointing further reduces the job completiontime. We experimentally compare BStream with other relatedworks and illustrate the benefits of using stream processingand checkpointing strategies in EC. Lastly, we characterize theoperational regime of BStream.

Index Terms—mapreduce, inter-cloud, stream processing

I. INTRODUCTION

In recent years, Hadoop MapReduce1 is extensively beingused for performing near real-time ”Big Data” analytics likepersonalized advertising, hourly traffic log analysis, frauddetection, sentiment analysis, and so forth [1], [2], [3]. Theseapplications have strict service-level objectives (SLOs) suchas deadline and experience highly variable input load. Missingthe SLOs due to high loads can result in significant penaltieslike, delay in updating customer facing content, associatedrevenue loss, and so forth. Hence, to handle load variations, aprivate data center is typically over-provisioned, which causeswastage of resources.

Cloud bursting provides an alternative to over-provisioningby offloading the excess load from the internal cloud (IC)to an external cloud (EC)(e.g. Amazon EC2, Rackspace) [4],[5], [6]. However, there are two main difficulties in scalingHadoop MapReduce using cloud bursting. The first difficultycomes from Hadoop being a batch-processing system thatrequires the entire input data for the job to be materializedbefore the start of computation. As the inter-cloud data transferlatencies are at least an order of magnitude higher than thatwithin a data center, batch processing using Hadoop in ECincurs huge startup latencies - waiting for the entire input

1http://hadoop.apache.org/

data to be uploaded, and then processing it. The seconddifficulty arises due to tight coupling of shuffle phase inMapReduce with reducers. Shuffle phase begins only afterreducers have started (occupied slots). As shuffle involves all-all node communication, shuffling of data from EC to reducersin IC takes longer due to difference between inter-cloud andintra-cloud bandwidth. This elongated shuffle phase not onlydelays job completion but also causes idle cpu cycles withinreducers.

Works like [7]–[9] overcome the first difficulty by enablingmappers in EC to perform streaming reads from remote HDFSin IC, thereby minimizing startup latencies. These works arebased on previous version of Hadoop (v1.0) where there isa fixed partitioning of map slots and reduce slots. Here, thereducers are started right at the beginning of the map phase,by default. This minimizes the delay in job completion timeto some extent (by overlapping the elongated shuffle from ECwith processing), but at the cost of idle cpu cycles.

In newer version of Hadoop called YARN2, the fixed parti-tioning of slots is no longer maintained. Thus, by delaying thestart of reducers, it is possible to use these slots for performingmap tasks. This improves utilization of slots thereby reducingjob completion time. In an inter-cloud setting with elongatedshuffle, it is not possible to delay the start of reducers, unlessthe shuffle is decoupled from reducers in IC. The shuffle fromEC can be further optimized by performing reduce operationin EC that decreases data download size.

We propose BStream, a new cloud bursting frameworkthat addresses the above difficulties and incorporates variousoptimizations. BStream uses stream processing engine calledStorm3 in EC and YARN in IC. Using Storm, both map andreduce operations are executed on the incoming stream of datain EC as and when it arrives. By overlapping processing withinput data transfer, BStream minimizes the startup latenciesin EC. By executing reduce operation in EC, it decreasesthe download size. Shuffle from EC is also decoupled fromYARN reducers, thereby allowing the reducers to start muchlater in the job lifecycle. Shuffle is further optimized usingcheckpointing strategies that download intermittent reduceoutput from EC to reducers in IC. Thus, using stream process-ing and checkpointing strategies, BStream enables pipelineduploading, processing and downloading of data. BStream usesan analytical model to estimate what portions of MapReduce

2http://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/3https://github.com/nathanmarz/storm



2

job to burst, when to burst and when to start the reducers, tomeet job deadline. We currently consider meeting deadlinesfor individual jobs whose input data is initially present onlyin IC. Such a scenario is encountered by enterprises that usetheir private data center (IC) for normal operation and employcloud bursting only when there are load surges. Extending theframework to multiple jobs is part of future work. Towards thisend, we characterize the operational regime of cloud burstingframework in terms of speedup it can provide to different jobsdepending on their input size, computation requirement, outputto input ratio, and available slack time.

The rest of the paper is organized as follows. Section IIintroduces MapReduce and Storm, and discusses differentapproaches for realizing inter-cloud MapReduce. Require-ments and challenges for realizing inter-cloud MapReduce arepresented in Section III. Section IV describes related work.Section V presents the analytical model for BStream andSection VI describes the BStream system architecture in detail.The performance evaluation of BStream is presented in SectionVII and the operational regime is characterized in Section VIII.Section IX discusses limitations of the model and proposesextensions. Section X presents the conclusion.

II. BACKGROUND

We first discuss MapReduce programming model and itsopen source implementation Hadoop. We also present changesproposed in Next Generation Hadoop known as YARN. Thisis followed by a discussion on different approaches of imple-menting inter-cloud MapReduce and the limitations therein.Some of these limitations can be overcome by stream process-ing in EC. BStream uses stream processing engine, Storm. Weconclude this section by describing Storm.

A. MapReduce and Hadoop

MapReduce [10] computation primarily consists of mapphase, shuffle and sort phase, followed by a reduce phase.A map function takes as input, a key-value pair and producesa list of intermediate key-value pairs that are partitioned andstored locally on each node. In the shuffle phase, the reducersfetch the intermediate data corresponding to their partitionfrom all the mappers. In the sort phase, the reducer sorts theshuffled data according to the key, to group the values for thesame key. Finally, the reduce operation is applied on the listof values associated with each distinct key to produce one ormore key-value pairs. The MapReduce runtime is responsiblefor parallelizing the execution of these functions and handlingissues related to load balancing and fault tolerance.

Hadoop, an open source implementation of MapReduce,consists of Hadoop Distributed File System (HDFS) andMapReduce runtime. The input files for MapReduce jobs aresplit into fixed size blocks (default 64 MB) and stored inHDFS. MapReduce runtime follows master-worker architec-ture. The master (Job-Tracker) is responsible for assigningtasks to the worker nodes. Each worker node runs a Task-Tracker which manages the currently assigned task. Eachworker node is configured with a fixed number of map and

M

Internal

External

RM

MR

Internal

External

GMR

M

Internal

External

RM

R MR

Internal

External

RM

(a) (b)

(c) (d)

Fig. 1. Approaches for Implementing Inter-cloud MapReduce

reduce slots. This hard partitioning of resources leads to under-utilization and wastage of compute cycles.

The new version of Hadoop, YARN (Yet Another ResourceNegotiator)), proposes a more generic resource model based oncontainer allocation, thereby removing the fixed partitioningof map/reduce slots per node. It also solves the limitation ofsingle Job-Tracker by splitting its functionality into a singleResource Manager (RM) per cluster and one Application Mas-ter (AM) per job. The RM is responsible for cluster resourcemanagement, while the AM is responsible for application life-cycle management of each job. We modified YARN to bursta portion of the job to EC, and merge the results from EC inthe reduce phase (refer Section VI-C).

B. Approaches for implementing inter-cloud MapReduce

Fig. 1 illustrates different ways in which a MapReducejob can be split across multiple clouds. In Fig. 1a, the maptasks are distributed across IC and EC, while the reduce tasksare executed only in IC. This results in huge data transferoverheads during shuffle operation. This could be overcomeby splitting the MapReduce job into different sub-jobs andexecuting them separately on IC and EC (refer Fig. 1b). Afinal MapReduce job (G) is responsible for merging the resultsof these jobs at the end. However, this approach introduces theoverhead of launching an additional job for performing globalreduction. Fig. 1c avoids this by distributing both map andreduce tasks across multiple clouds. However, the differencein inter-cloud and intra-cloud latencies can lengthen the shufflephase, thereby wasting lots of compute cycles of the nodes.Also, the final result from reducers in EC has to be downloadedto IC. In BStream, we adopt an approach illustrated in Fig. 1dwherein the output of MapReduce job in EC is directly trans-ferred to the reducers in IC. We propose a model to determinethe start time of reducers in IC, considering the overheadsassociated with inter-cloud and intra-cloud data transfer. Thisminimizes wastage of compute cycles during shuffle phasewithout incurring the overhead of global reduction.

C. Storm

Storm is a distributed, fault-tolerant, real-time computationengine built upon the message processing paradigm. Computa-tion in Storm is expressed using topology consisting of spoutsand bolts. A spout corresponds to a source of data stream. A



3

bolt processes incoming data streams from connected spoutsand emits output streams that can be consumed by otherbolts. Storm runtime also follows master-worker architecture.Master (Nimbus) is responsible for task distribution andfault-tolerance, whereas workers (Supervisors) are responsiblefor task execution. Co-ordination and state management aredone using Zookeeper4. Storm provides support for exactly-once processing semantics using transactional topologies5.Bstream uses Storm with transactional topology in EC. Unlikea Hadoop MapReduce job, a Storm topology runs forever.Therefore, we added support for topology termination in Storm(refer Section VI-D).

III. REQUIREMENTS & CHALLENGES

In this section, we discuss design considerations and corre-sponding challenges for meeting deadlines for a single job.

A. Allocate just enough resources to a job to meet deadline

This implies that a job uses minimal cluster resources tomeet its deadline thereby keeping as many resources free forthe execution of other jobs. To realize this policy, we needa model that can determine for a given input, the resourceprovisioning across IC and EC such that job completion timeis pushed as close to the deadline as possible. YARN nolonger maintains a hard partitioning of map and reduce slots.Thus, cluster utilization can be improved by reducing theidle time within slots. A candidate for this improvement isreducer slot. In MapReduce, the reducers are started earlyin the job execution to overlap shuffling of data with mapphase. However, this introduces idle time within reducer slots.Analytical model discussed in Section V determines the timeto start the reducers based on shuffle data size.

B. Batch vs Stream processing in EC

The completion time of a job in EC can be broken into threecomponents - upload time, processing time, and downloadtime. The upload time is same for both batch and streamprocessing. Stream processing however allows the input datato be processed continuously as it arrives thereby overlappingprocessing with upload. If the processing rate is equal tothe upload rate, then processing delay becomes zero. Thedownload time for stream processing is also less, if thedownload is overlapped with upload/processing of the job.

To quantify the benefits of using stream processing inEC, we compared batch processing (Hadoop) and streamprocessing (Storm) under different operating conditions byvarying input data size, computation time per map, inter-cloudbandwidth, and output to input ratio in WordCount application.With fixed computation time per map (mt = 50s) and inter-cloud bandwidth (5MBps), and varying input size, the totalcompletion time of Storm jobs is lower by 23% on averagethan the corresponding Hadoop jobs (refer Fig. 2a). In Fig.2b, with fixed input size (1GB) and inter-cloud bandwidth(5MBps), and varying computation time per map (mt), the

4http://zookeeper.apache.org/5https://github.com/nathanmarz/storm/wiki/Transactional-topologies

processing delay for Hadoop increases, while the processingdelay for Storm continues to be zero. Thus the difference inthe completion time between Hadoop and Storm increases withincrease in map computation time. In Fig. 2c, we observe thatas inter-cloud bandwidth increases, the overhead associatedwith data transfer decreases. Hence, the total completion timeof Hadoop comes closer to that of Storm.

Thus, we conclude that for jobs requiring significant datatransfer, stream processing performs much better (in terms ofjob completion time) than batch processing using the samenumber of resources in EC. As long as the processing ratematches the upload rate, the performance gain of streamprocessing over batch processing increases with increase inthe computation requirement of the job.

C. Minimize effect of network latency due to cloud bursting

Stream processing in EC enables pipelined uploading,processing and downloading of data. However, downloadingpartial outputs from the reducers in EC would incur additionalnetwork cost due to retransmission, in case, the value forany downloaded key gets updated. It would also transfera large portion of reduce computation back to IC. On theother hand, waiting for entire processing to complete beforedownloading the output would increase wait time. Thus, thereis a tradeoff between additional reduce computation in ICversus latency for downloading the output from EC. Wepropose different checkpointing mechanisms in Section VI-Eto efficiently transfer partial outputs from EC.

Failure of the nodes (reducers in IC) that receive updatesfrom EC will lead to additional network costs and delay in jobcompletion. BStream uses replication mechanisms discussed inSection VI to handle failure of receiving nodes.

D. Use EC only when necessary

If deadline can be met by just using resources in IC, thenthe system should not use resources in EC. Further, the systemshould delay bursting of job to as much later as possible inthe job lifecycle to incorporate the possibility of new slotsbecoming free in IC. We discuss how BStream incorporatesthis requirement in Section V.

IV. RELATED WORK

Hadoop MapReduce does not offer any guarantees on thecompletion time of jobs [11] [12]. Recently, many workshave proposed different models (analytical models, simulationmodels, and so forth) to guarantee deadlines for MapReducejobs. We discuss these models in Section IV-A. Relevant workson multi-cloud MapReduce are discussed in Section IV-B.

A. Guaranteeing Deadlines for MapReduce within SingleCluster

Kamal et al. [11] propose a cost model to determinewhether job deadline can be met. Any job whose deadlinecannot be met is rejected. The cost model assumes that mapphase, shuffle phase and reduce phase execute sequentially,thereby ignoring the overlap between map and shuffle. Verma



4

204 204408 408

612 61285 0

123 0

1540

8275

164123

246

170

0

200

400

600

800

1000

1200

H-1G S-1G H-2G S-2G H-3G S-3G

Tim

e (s

econ

ds)

Configuration

Upload time Processing overhead Download time

mt = 50s

(a) Varying input data size

204 204 204 204

850

215

0

82

75

82

75

0

100

200

300

400

500

600

H-50s S-50s H-200s S-200s

Tim

e (s

eco

nd

s)

Configuration

Inter-cloud bw = 5MBps

(b) Varying computation time

204 204102 102

215

0 21575

82

75

41

39

0

100

200

300

400

500

600

H-5bw S-5bw H-10bw S-10bw

Tim

e (s

eco

nd

s)

Configuration

mt = 200s

(c) Varying inter-cloud bandwidth

Fig. 2. Hadoop Vs Storm in External Cloud: the performance of stream processing (Storm) with respect to batch processing (Hadoop) improves when theratio of computation time to data transfer time increases

et al. [13] propose a framework called ARIA (AutomaticResource Inference and Allocation) that uses job profilingand analytical model to schedule deadline-based MapReducejobs. ARIA considers overlap between map and shuffle phase,and determines the number of map and reduce slots requiredto meet job deadline. Some of the profiling parameters usedin BStream such as msel, rsel,mt, rt are similar to those inARIA (refer Section V). However, ARIA does not estimate thereduce slowstart parameter which can enable better utilizationof slots as ARIA assumes that the reducers are started rightfrom the beginning of execution. ARIA has been furtherextended to support dynamic allocation and deallocation ofspare cluster resources during runtime. Starfish [14] is a self-tuning system for Hadoop that builds a detailed profile ofjob-level and cluster-level parameters, and automatically findsoptimal Hadoop configuration settings for running MapReducejob. This is complementary to our approach and can be usedto improve Hadoop’s performance in IC. Ferguson et al. [15]propose a framework called Jockey that uses monitoring andadaptation to provide support against changes in cluster profileand job deadlines. Their approach considers a more generalprogram structure compared to MapReduce. None of the aboveapproaches utilize resources from EC for meeting deadlines.

B. Multi-Cloud MapReduce

Multi-cloud MapReduce is implemented by using customframeworks in [5], [16]–[20] and by using Hadoop in [7]–[9].We classify the survey of related work into non-Hadoop basedframeworks and Hadoop-based frameworks.

Non-Hadoop based frameworks: As discussed previously(refer Section II-B), there are different approaches for parti-tioning MapReduce job across clusters. In [16], the MapRe-duce job is partitioned according to policy shown in Fig. 1b,where MapReduce jobs are run locally in each cluster and theirresults are aggregated using a ”Global-Reduce” phase. Theinput data is partitioned across different MapReduce clustersaccording to their compute capacity. However, the input/outputsizes are assumed to be negligible. Works like [17], [20] haveproposed deadline-aware scheduling for MapReduce on top ofdifferent cloud platforms like Aneka [5] and CometCloud [21]respectively. In both of them, the job is partitioned accordingto Fig. 1c which incurs high network latencies when theshuffle data size is huge thereby increasing the duration of

reduce. The authors of [17] assume that data is replicatedon either cloud and considers deadlines only for the mapphase of the MapReduce job. In [18], the authors use ageneralized reduction API that integrates map, combine andreduce operation into a single operation (local reduction) atthe node-level to reduce overheads of shuffling, grouping, andsorting large intermediate data. A global reduction is finallyperformed to aggregate all the node-level results. The goal is tocomplete the job as early as possible. This work was extendedto incorporate deadline/cost constraints in [19]. However, theirtask stealing approach does not consider the output to inputratio and can result in accumulation of large amount of partialoutputs on a node with slow outlink causing considerabledelays in global reduction stage. We illustrate the downsideof such a task stealing approach in Section VII-C.

Some works extend MapReduce to support Stream process-ing by operating on input as it arrives [22], [23]. Howeverthese works are restricted to a single cloud.

Hadoop based frameworks: Cardosa et al. [24] exploredthe efficiency of running MapReduce across two clouds indifferent configurations such as Local MapReduce, GlobalMapReduce and Distributed MapReduce. They showed thatdepending on the nature of job, one of the configurationsperforms better. However, their approach uses batch process-ing. We have shown in Section III-B that batch processingin EC causes significant startup latency. Few works supportstreaming reads within maps from remote HDFS [7]–[9]. Thus,they overlap input data transfer from IC with processing withinmaps in EC. While this approach decreases the startup latency,it increases the overhead during task failures in EC as theincoming data is not stored in EC. In [8], each incoming datablock is replicated before the map is performed on the localblock. The above works partition the job according to policyshown in Fig. 1a where only maps are executed in EC. In[7], the size of input split to each node is varied accordingto the node’s compute capability and link speed. The nodesare kept as busy as possible by ensuring that processing andsubsequent shuffle of map output are synchronized across allof them. The authors of [8], [9] propose similar techniquesreferred as map-aware push and shuffle-aware map consideringupload rate, processing rate and download rate of every node todecide data distribution and task placement. While the paper[7] assumes a single source, [8], [9] work with distributed



5

sources. All of them do not perform reduce operation in EC.Thus, they have to shuffle the entire map output from EC toreducer nodes in IC. This approach suffers from network over-heads when the jobs are shuffle-heavy. In order to minimizenetwork overheads, authors in [25] proposed algorithms for theplacement of reduce tasks to exploit locality of map output.But their placement algorithm ignores the final destinationwhere the output of reduce is required. Thus, it may endup running reduce task on a node with a slow outlink. InYARN, there is no hard partition of slots into map and reduceslots. Therefore, if the reducers are started as late as possiblein the job execution, then the reduce slots can be used torun more number of parallel maps. However, to overlap theshuffle with the maps especially when downloading data fromremote nodes having slow outlinks, the reducers have to bestarted early on because the shuffle is coupled with the Hadoopreducer. We illustrate the downside of this coupling approachin Hadoop-based frameworks in Section VII-C.

Volunteer-based frameworks: Works like MOON [26],VMR [27], P2P-MapReduce [28] and so forth, exploit volun-teer resources to perform distributed MapReduce. Volunteerenvironments face challenges of node heterogeneity (network,storage and compute capacity of participating nodes), highresource churn, and security. MOON [26] uses a small set ofdedicated nodes in addition to volunteer nodes to ensure thatdata is highly availabile and tasks can complete in the presenceof high churn. VMR [27] replicates tasks on multiple volunteernodes and validates the results from them using majorityvoting. However, both MOON [26] and VMR [27] assumethat centralized components like Namenode, Job tracker arerun on dedicated resources and do not handle their failure. Incontrast, P2P-MapReduce [28], uses a peer-to-peer model withmultiple nodes acting as masters (e.g. multiple job trackers) toaccount for master failures in the presence of churn. In general,guaranteeing strict SLOs with volunteer resources at all timesis difficult due to the volatile nature of participating nodes.While volunteer resources can accelerate the performanceand improve the system’s goodput, an external cloud withdedicated resources may be necessary to stabilize MapReducecomputation (meeting strict SLOs). Exploring hybrid cloudsinvolving volunteer resources is part of our future work.

V. ANALYTICAL MODEL

Analytical model provides a resource-allocation plan tomeet job deadline. It estimates parameters such as numberof resources to be provisioned in IC/EC, number of maps toburst, when to burst, and when to start reducers in IC. Weclassify the model parameters into input parameters, profile(job/cluster) parameters and output parameters.Input parametersJ job-profile typeM number of map tasks within a jobR number of reduce tasks within a jobT deadline/completion time of jobbEC inter-cloud bandwidth

Profile parametersB block size in HDFS (typically 64MB)mt processing time per map

rt processing time per reducemsel

(rsel)selectivity (ratio of output size to input size) ofthe map (reduce) phase

k size of data output per map to a reduceα rate at which data is transferred and merged during

the shuffle and sort phaseβ time to process per MB in the Hadoop reducerγ degree of overlap of download with processing in

ECOutput parametersTs duration of shuffle phaseTr duration of reduce phaseTbs time at which shuffle startsTbr slack time for burst portion to returnTst start time for bursting tasksObr output to input ratio of MR job on burst portionA number of slots allocated in ICMIC number of maps executed in ICMEC number of maps burst to ECCm number of maps completed before start of shuffle

phaseOur model assumes that resources in IC are homogeneous

and there is no map/reduce skew. These assumptions implythat map (or reduce) tasks executing in parallel on differentnodes will complete at the same time. Since the focus ofthis work is to demonstrate how a hybrid model of batchprocessing in IC and stream processing in EC can reduceoverall job completion time, we make the above assumptionsto simplify the estimation of job completion time in IC.Handling skew and heterogeneity is part of our future work.We also assume that reduce operation is associative. Thisallows incremental processing of data in reduce tasks in EC.If reduce is non-associative, then only maps can be run onEC. A majority of applications currently being modeled inMapReduce have their reduce operations as associative.

The model estimates allocation (A) for the submitted jobto meet its deadline. As shown in Fig. 3, the start of reducersis delayed until Cm maps are complete. This allows a majorportion of shuffle phase to be overlapped with map phase. Atthe same time, it prevents idle time slots within a reducer (dueto early start) during shuffle phase. We consider the case wherethe number of resources that can be allocated in IC is greaterthan the number of reducers (R) so that all the reducers cancomplete in a single wave6.

When Cm maps have completed, the amount of data pro-duced from the mappers to be transferred to a single reducer(Da) is given by,

Da = Cm · k (1)

Once the reducers start, the number of slots available forexecution of maps is A − R. Thus, at any time t, starting atthis point the amount of data available for the shuffle/sort perreduce is given by

D(t) = Da +(A−R) · k

mt· t (2)

6This is typically done in MapReduce to overlap shuffle with map phase.



6

Tbs Ts Tr

Map

A

R

TbrTst

Cm maps

Shuffle/

SortReduce

Fig. 3. Job Execution Timeline

where (A−R)mt

is the rate at which maps complete.To ensure total overlap between shuffle and map phase, the

data that can be shuffled till the end of map phase (withinduration Ts) must be equal to the total size of map output

D(Ts) = α · Ts (3)

Shuffle duration Ts can also be expressed in terms ofexecution time of left-over maps at the beginning of shufflephase. Thus,

Ts =(M − Cm) ·mt

(A−R)(4)

Substituting the value of Eq. (2) in Eq. (3), we get

Da + (A−Rmt

) · k · Ts = α · Ts (5)

Simplifying Eq. (5) using Eq. (4) and Eq. (1), we get valuefor Cm as

Cm =M ·[1− k · (A−R)

α ·mt

](6)

Referring to Fig. 3, the total time (T ) taken by the job tofinish is given by

T = Tbs + Ts + Tr (7)

where Tbs = Cm·mtA . Since there is a single reduce phase, we

have Tr = rt.Substituting the values from above equations, and simplify-

ing Eq. (7) we get,

A =(α ·mt +R · k) ·M

α · (T − rt)(8)

k refers to the size of data output per map to a reduce task.Thus, k can be expressed in terms of map selectivity as k =msel·BR . Similarly, the processing time per reduce task can be

expressed as rt = β·M ·msel·BR , where β refers to the rate at

which data is processed within a reduce task.Suppose the number of resources required to meet the

deadline within IC (A) exceeds the available resources in IC(A′), then we need to use resources in EC to meet the jobdeadline. With A′ resources in IC, the number of maps M ′

that can be executed internally before deadline T is given by,

M ′ =A′ · α · (T − rt)α ·mt +R · k

(9)

If the remaining maps (M − M ′) can be burst such thatthe burst job completes within slack time (T − rt), then thedeadline can be met. Therefore, we consider MEC = M - M ′

while estimating the time for burst job to complete (Tbr). Weassume that the output to input ratio (Obr) when MapReduce isrun on the burst portion is equal to msel ·rsel. This assumptionholds when the number of keys in the data set is huge.

Using checkpointing schemes discussed in Section VI-E,output transfer can be overlapped with input/processing phasein Storm. Denoting (γ) as the degree of overlap, the value forTbr is given by,

Tbr =MEC ·BbEC

·[1 + (1− γ).Obr

](10)

The latest time by which bursting must start (Tst) is givenby,

Tst = Tbs + Ts − Tbr = T − rt − Tbr (11)

If the deadline cannot be met by bursting, then we determinethe earliest finish time for the job. Substituting Tst = 0 in Eq.(11), we have Tbr = T - rt. Using MEC =M −MIC in Eq.(10) and substituting the value of Tbr from Eq. (10) in Eq.(9), we get,

MIC =M · α ·A ·B

· Xterm

(α ·mt +R · k) · bEC + α ·A ·B ·Xterm

(12)

where Xterm = 1 + (1− γ) ·Obr

With only MIC out of M maps executing in IC, the shuffledata size in IC is reduced. Thus, we calculate the start ofreducer considering only the shuffle data size generated byMIC maps. Thus, the value of Cm (to completely overlap mapand shuffle phase) is calculated by substituting M =MIC inEq. (6). The fraction of maps (Cm/M ) that must be completedbefore starting the reducers is known as reduce slowstart. Itis a configuration parameter in Hadoop to control the start ofreducer. We illustrate the effect of reduce slowstart on the jobcompletion time in Section VII-B1.

VI. BSTREAM

We first present an overview of the cloud bursting frame-work and then describe the important components in detail.

A. System Architecture

BStream couples stream-processing (using Storm) in ECwith batch processing (using Hadoop) in IC to realize inter-cloud MapReduce. Since the input data is already loadedin HDFS, batch processing is more efficient in IC. Streamprocessing is preferred in EC as it can perform computationon the incoming data as and when it arrives.

Fig 4 shows the system architecture. The user submitsthe MapReduce job to the controller along with its deadline(step 1). The controller uses the estimator to determine theresource allocation (step 2). The estimator refers the jobprofile database and uses the analytical model (refer SectionV) to estimate the resource allocation in IC (step 3). If theestimated resource allocation returned to the controller bythe estimator (step 4) exceeds the available resources in IC,then the controller initializes the burst coordinator (step 5). It



7

DB

S

S

MB

MB

MB

RB

RB

1

2

5

8a

8b

8c

9c

9d

9a

9a 9b

LevelDB

M

M

R

R

DN M

RM DN

DN

N

AM

3

4 6

8d

8e9e

10

Storm

Hadoop

Kafka Msg Queue

7

External Cloud

(Stream processing)

Internal Cloud

(Batch processing)

Control signal

Data transfer

AM: Application Master

M: Mapper

R: Reducer

N:Nimbus

S: Spout

Ck: Check-point controller

Z: Zookeeper

LevelDB: Hash structure

DN: Data Node

RM: Resource Manager

C: Controller

BC: Burst coordinator

E: Estimator

DB: Profile Database

Fig. 4. BStream: Cloud Bursting Framework for MapReduce

passes the values for (computed by the estimator) number ofmaps to be burst and the time to start bursting to the burstcoordinator. The controller submits the MapReduce job to theHadoop framework (step 6). The burst coordinator submits theStorm topology to Nimbus (step 7). It sequentially sends burstcontrol signal for each data block to one of the data nodes(step 8a). The data node starts bursting data to the Kafka7

message queue (step 8b). The Storm spouts start consumingmessages from Kafka in parallel (steps 8c). These messages(tuples) are processed through the Storm topology and theoutput is produced from the reduce bolt during commit8. Theburst coordinator simultaneously updates the status of datatransfer in Zookeeper (step 8d) that is referred by Storm todetect the end of data transfer (step 8e). The output fromthe reduce bolt is written to LevelDB9 in EC (step 9a). Twodifferent checkpointing strategies (discussed in Section VI-E)are proposed to transfer this output to the reducers in IC (step9b). Zookeeper is also used to store the reducer configurationin IC. This is used by the checkpoint controller to transferthe Storm output to appropriate reducer nodes in IC (step 9c).Hadoop AM is responsible for updating the Zookeeper withreducer configuration changes (step 9d). These updates areconsumed by the checkpoint controller, which uses them forsubsequent transfers from EC (step 9e). The reducers in ICmerge the output from Storm and Hadoop to produce the finaljob output (step 10).

B. Profiling and Estimation

Profiling builds a job profile consisting of parameters dis-cussed in Section V. The authors of [13] have shown that

7http://incubator.apache.org/kafka/8In Storm, the map bolts (MB) are specified as regular bolts, while the

reducer bolts (RB) are specified as committer bolts. This sets the job to makeuse of the transactional processing semantics.

9http://code.google.com/p/leveldb/

3 4 5 60

100

200

300

400

Tim

e (s

econ

ds)

Number of workers

mt=50s

mt=100s

mt=200s

Upload time at10MBps bandwidth

Processing delayUpload time at

5MBps bandwidth

Fig. 5. Profiling of Storm Job

profile parameters such as map execution time, map selectivity,output to input ratio remain consistent across recurring datasetsfrom the same data source (e.g. Wikipedia article traffic logsof different months have similar profile). Therefore, we doan offline profiling once for every unique job type (e.g.WordCount, InvertedIndex) on each data source. We run thejob in IC with reduce slowstart as 1 and measure the profileparameters. Setting the reduce slowstart as 1 ensures thatreducers start only after all the maps are complete. Therefore,there is no waiting time within a reducer slot during the shufflephase. Thus the measured profile parameters are free from anyextraneous delays.

Storm job profiling determines the minimum number ofworkers to be provisioned in EC such that data processingrate matches upload rate. As the number of Storm workersincrease, processing time of a job decreases due to increasedparallelism. However, processing rate is limited by uploadrate to EC. If the number of workers required to matchprocessing rate with arrival rate exceeds the maximum numberof workers in EC, then additional processing delay is incurred.Incorporating this delay in our analytical model, Eq. (10)becomes

Tbr =MEC ·BbEC

·[1 + (1− γ) ·Obr + bEC · pdelay

](13)

where pdelay = 1processing rate−upload rate

Using the updated value of Tbr, value of Xterm in Eq. (12)becomes

[1 + (1 − γ) · Obr + bEC · pdelay

]. We use these

updated equations in our experiments.Fig. 5 illustrates Storm job profiling for WordCount jobs

on 1GB synthetic dataset with varying per map computationtimes (mt). At 5MBps, uploading 1GB data takes 204 seconds.We observe that for 50s and 100s computations, processingtime is greater than upload time with 3 Storm workers whileit is lower with 4 Storm workers. Thus, we choose 4 Stormworkers to match processing rate with upload rate. Similarly,for 200s computation, we choose 6 workers at 5MBps inter-cloud bandwidth. At 10MBps, processing time in EC is morethan upload time even with 6 Storm workers. Therefore, weincorporate the processing delay (pdelay) in our estimation.

When a job is submitted to the framework, the estimatorretrieves the profile parameters from the profile database anduses the analytical model to determine the resource allocationplan. This corresponds to the output parameters discussed in



8

300

400

500

600

700

Default EvenDist. EvenDist. +

SlowStart

Tim

e (i

n s

eco

nd

s)

Fig. 6. Optimizations in Hadoop Reduce

Section V. The controller takes responsibility of executing thejob according to the resource allocation plan.

C. Modifications to Hadoop

We modified YARN to burst map tasks to EC and tocombine the output from EC in the Hadoop Reducers.

Bursting Map tasks: When a job is submitted, Hadoopruntime registers the file-split information of that job with theburst coordinator. The burst coordinator uses this informationto select unscheduled file-splits for bursting to EC. The file-splits are transferred as 256KB messages to the Kafka messagequeue on EC. To ensure that the burst maps are not processedinternally, each Hadoop mapper task queries the burst coordi-nator about the status of the file-split. If the file-split is burst,then the map task returns immediately.

Combining the EC output within Hadoop Reducer: Asshown in Fig. 4, the outputs from EC are directly transferredto reducer nodes in IC. This avoids the additional intra-clusterlatency within IC. The burst coordinator stores in Zookeeperthe initial set of nodes where the Hadoop reducers are likelyto run. The AM tries to maintain this reducer configurationby specifying hints while requesting for containers from RM.If the location of the reducer changes, then AM updatesthe Zookeeper. The checkpoint-controller on each Storm re-ducer node refers to this configuration before transmittingthe checkpoint data. If the reducer configuration changes, thenew reducer reads the missed checkpoint data from HDFS10.Each Hadoop reducer node runs a separate merger thread thatmerges the checkpoint data (as it arrives) with the previousoutput.

Reduce phase optimizations11: In Hadoop, the reducetasks can be assigned to any free container, resulting inuneven distribution among the nodes. If many reduce tasksrun concurrently on the same node, then it increases the I/Ocontention and slows down performance. Therefore, we incor-porated a new allocation scheme that distributes the reducersevenly across different nodes. The default reduce slowstartin Hadoop is 0.05. For applications having low shuffle datasize, the default slowstart results in idle time within reducerslots. To minimize this, we calculate the reduce slowstartvalue using analytical model. Fig 6 shows the completion timeof WordCount application on 4GB Wikipedia dataset usingthe above-mentioned optimizations. We observe that even

10For fault-tolerance, the receiving node replicates checkpoint data onHDFS.

11These optimizations are applicable to any Hadoop job (not restricted tocloud bursting.)

RB

RB

R

R

Updates from

Storm Reducer Initiate

check-pointCheck-point

Check-point

Controller

Check updates

for key

Remove key with updated value

from checkpoint

Updated

check-point

Update Hadoop

Reducer with

check-point

External Cloud

Internal Cloud

LevelDB

Fig. 7. Checkpointing Strategy

distribution of reducers improves performance by 10% overdefault Hadoop. This together with reduce slowstart predictionimproves performance by 30% over default Hadoop. Moredetails on reduce slowstart are presented in Section VII-B1.

D. Modifications to Storm

We modified Storm to implement the notion of job com-pletion. The end of stream is indicated by a state change inZookeeper. The burst coordinator is responsible for updatingthe state changes. It is also responsible for killing the Stormtopology once the output is downloaded from EC.

E. Checkpointing

The time spent in downloading the output from EC cancontribute a significant portion to the overall completion timeof the burst job if the output to input ratio is high. With streamprocessing in EC, partial outputs are available at the reducersas soon as some data is processed through the Storm topology.We use checkpointing to overlap output transfer with inputtransfer/processing, thus minimizing the overall completiontime of the job.

In checkpointing, we take a snapshot of the current output ofa reducer and transfer the snapshot data to IC. Fig. 7 presentsthe checkpointing strategy in detail. The partial outputs fromreducer are stored in LevelDB. The data output rate fromStorm reducers is higher than the rate at which data can bedownloaded. Therefore, a key (present in the output snapshot)can get updates from Storm reducer during checkpoint transfer.Such an updated key will require retransmission in the nextcheckpoint. To reduce the probability of retransmission, thecheckpoint controller employs continuous updating strategy,where it checks for updates before sending a key to reducersin IC. If a key has received updates, then that key is nottransmitted as part of the current checkpoint, thus reducingthe retransmission overhead. If the key does not receive anyfurther updates, it will be transmitted as part of the nextcheckpoint. Since the network is kept busy throughout, there isno loss in terms of bandwidth utilization. If a certain key getsupdated after it has been transmitted, then that key and thedifference between its output value and checkpoint value aretransmitted as part of the next checkpoint. The next checkpointalso includes keys that newly arrived in the interval betweenthe two checkpoints. Depending on the characteristics of the



9

dataset and the size of data processed so far, the probabilityof updates or addition of new keys keeps varying. We proposethe following two schemes for checkpointing.

In scheme 1, the first checkpoint is initiated when thedownload time for partial outputs stored in LevelDB is sameas the remaining execution time for the Storm job. Any missedupdates or new keys are transferred in the second checkpoint.Thus, there are only two checkpoints in this scheme. In scheme2, the first checkpoint is initiated when the time for download-ing the entire output of the Storm job is same as the remainingexecution time of the job. Subsequent checkpoints are issuedimmediately after the completion of earlier checkpoint. Thus,there can be more than two checkpoints in this scheme. Inboth schemes, the total size of checkpoint transfer can at mostbe twice the total output size of the job. However, the firstcheckpoint generally occurs much earlier in scheme 2 than inscheme 1.

Once the checkpoint data is received at the IC, they need tobe merged on the Hadoop reducer nodes with the shuffle/sortdata generated from the maps running internally. The datawithin each checkpoint is already sorted as LevelDB maintainsthe keys in a sorted manner. Thus, the merge operation in ICis analogous to the merge phase in merge-sort algorithm.

In general, the checkpoint process starts early in scheme2 compared to scheme 1. Thus, it is expected to achievea better degree of overlap of output transfer with inputtransfer/processing. Scheme 1 does not assume anything aboutthe output to input ratio of the arriving job. Therefore, it canadapt to changes in the arriving job profile with respect tooutput to input ratio. Scheme 2 assumes that the output toinput ratio of the arriving job exactly matches the one storedin profile database. In case of mismatch, scheme 2 can result inlower degree of overlap or too much extra download, affectingperformance. Section VII-B3 presents performance evaluationof checkpointing schemes.

F. Fault-toleranceFault-tolerance mechanisms must ensure reliable processing

of data in the event of node/task failures in Storm/Hadoop.Since the inter-cloud bandwidth is low, we must also avoidretransmission of input data from IC to EC or output datafrom EC to IC to prevent performance overheads.

Reliable processing of the non-burst portion of input isensured by Hadoop’s fault-tolerance mechanisms. If data fromIC is directly streamed to Storm nodes in EC, then it mustbe retransmitted from IC when any of the Storm tasks fail.Therefore, we first store the uploaded data in distributed Kafkaservers in EC. These Kafka nodes act as sources of datafor Storm workers. In the event of worker failure, data isread from Kafka servers in EC, thus avoiding retransmissionfrom IC. Currently we do not handle failure of Kafka servers.Exactly once processing of the incoming data stream in Stormis handled using transactional topologies. Finally, to avoidretransmission of checkpoint data from EC to IC, the IC nodethat receives the data replicates it to HDFS. Since the internalbandwidth is an order of magnitude faster than inter-cloudbandwidth, reading from HDFS does not result in additionaldelays during node failures.

0.05 0.2 0.4 0.6 0.8 0.9 1600

700

800

900

1000

Reduce Slowstart

Tim

e (s

econ

ds)

Default in Hadoop

Predicted by model

Fig. 8. Effect of Reduce Slowstart on Job Completion Time

VII. PERFORMANCE EVALUATION

We first perform a standalone evaluation of BStream tovalidate the analytical model, meet job deadlines, quantifybenefits of checkpointing schemes, and then compare itsperformance with related works.

A. Experimental Setup

Our experimental setup consists of two clouds, IC and EC.IC consists of 21 nodes running Hadoop 0.23.0 with 20 slavenodes and 1 master. Each node has 2 AMD Opteron dualcore processors running at 2GHz, 4GB RAM, and 500GBSATA disk drive. The HDFS block size is set to 64MBand each node is configured to run atmost 2 tasks (map orreduce) concurrently. The nodes are organized in 2 racksinterconnected by 1Gbps LAN. EC consists of 12 nodes witheach EC node having Intel quad-core processor with 3.06 GHz,8GB RAM and is interconnected by 1 Gbps LAN. The nodesrun Storm version 0.8.1 with 1 worker per node.

B. Stand-alone evaluation of BStream

This section is organized as follows. We first show resultsfor reduce slowstart estimation. Next we show how the ana-lytical model is used to meet job deadlines. This is followedby evaluation of checkpointing schemes in BStream. Finally,we present an overall summary of BStream model validation.

1) Reduce slowstart: Reduce slowstart corresponds to thefraction of maps that must be completed before the Hadoopruntime launches the reduce tasks. Setting this value appro-priately has significant benefits in YARN, as it no longermaintains a hard partitioning of map and reduce slots. Thus,all the slots in the cluster can be used to run map tasks, till thereducers start. Fig. 8 shows how the job completion time varieswith reduce slowstart parameter for WordCount application on16GB Wikipedia dataset. The number of reduce tasks is 12and the total number of slots allocated to the application is28. We observe a V-shaped curve that has minima (679s) at0.8. The analytical model predicts reduce slowstart as 0.91,for which the completion time is 687s. Thus, the analyticalmodel predicts very close to the minima. The completiontime for Hadoop’s default value of slowstart (0.05) is veryhigh (958s). Thus, specifying reduce slowstart appropriatelyconsiderably reduces its completion time. The importance ofreduce slowstart is further illustrated in Section VII-C.

2) Meeting deadlines: We describe in detail three experi-ments for meeting job deadlines using WordCount applicationon 32GB Wikipedia dataset. The maximum number of slots inIC is 28. The inter-cloud bandwidth is set to 5MBps. Fig. 9a



10

700900

1100130015001700

No bursting Bursting at

5MBps

Bursting at

10MBps

Tim

e (i

n s

eco

nd

s)duration deadline

(a)

Scenario Deadline SlotsIC NumBursted SlowStart BurstStart

No Bursting 1500 25 0 0.98 0

Bursting at 5MBps 1300 28 50 0.87 522

Bursting at 10MBps 1100 28 128 0.88 163

(b)

Fig. 9. Resource Provisioning for Meeting Deadlines

shows the observed job completion time and the correspondingdeadline for the three runs. The runtime parameters estimatedby the model are shown in Fig. 9b. We observe that for thefirst case with deadline equal to 1500s, the model determinesthat no bursting is required to meet the deadline. The requirednumber of slots in IC is estimated as 25 and reduce slowstartas 0.98. With this configuration, the job completes in 1436sthereby meeting the deadline. In the second case, the deadlineis reduced to 1300s. The model determines that bursting isrequired to meet the job deadline. It estimates the number ofmaps to be burst as 50 and reduce slowstart as 0.87. Accordingto the design principle of ”using EC only when necessary”,the model uses the maximum number of slots available in IC(28 slots) and delays the start of bursting to 522s. The jobcompletes in 1262s. In the third case, the deadline is furtherreduced to 1100s. As the deadline cannot be met with inter-cloud bandwidth of 5MBps, we increase the bandwidth to10MBps. The job completes in 1088s with 128 maps burstand reduce slowstart as 0.88.

BStream tries to use minimum resources to meet jobdeadline. In the above experiments, we observe that the jobcompletion time using BStream is quite close to the deadline.We also observe that the burst job’s results always returnedbefore the reducers in IC completed shuffling the data. Themaximum observed time gap is very small (≈50s). Thissuggests that BStream analytical model is able to correctlyestimate resource provisioning in EC. More results validatingthe analytical model are presented in Section VII-B4.

3) Checkpointing schemes: As the output to input ratioincreases, the overhead of downloading output at the end ofcomputation becomes higher. For example, a burst job whoseoutput to input ratio is 0.8 has overheads about 40% of thetotal completion time. BStream uses checkpointing schemesto overlap download with processing/upload.

The following metrics are used to compare checkpointingschemes with FinalUpdate (FU) scheme where output isdownloaded from EC at the end of computation:• Degree of overlap (dov): refers to percentage overlap of

output transfer with input transfer/processing.• Extra Network cost (enc): refers to percentage of extra

data downloaded by checkpointing due to retransmissionof keys that received updates after previous checkpoint.

• Reduction in completion time (rdc): refers to percentageimprovement in job completion time over FU scheme.

A high output to input ratio can arise due to:1) The number of unique keys in the input data is large.

Therefore, even after reduce operation (which aggregates

TABLE ISCHEME 1 VS SCHEME 2: VARYING OUTPUT TO INPUT RATIO

dov (%) enc (%) rdc (%)otoi S1 S2 S1 S2 S1 S20.2 36.5 50.0 0.7 2.00 7.4 10.20.4 28.9 47.8 10.2 14.0 8.8 14.60.6 43.5 54.3 5.9 20.7 17.5 21.90.8 43.7 58.9 3.4 11.4 21.1 28.4

the values for a given key) in EC, the output size issignificant.

2) The number of unique keys in the input data is smallbut the size of output value for each key is huge(e.g.InvertedIndex). The output value for each key (word)consists of list of files and the corresponding countwithin each file where the word appeared.

We use WordCount application to illustrate case 1 andInvertedIndex to illustrate case 2. We generate data sets withdifferent output to input ratios using uniform distribution.The size of each key (word) in the dataset is 32 bytes.We setup a server in IC that reads these datasets from alocal file and uploads them to Kafka cluster in EC. Wesimultaneously activate the Storm topology. We log the timeand size of each checkpoint produced during execution. Thecheckpoints are transferred to IC using the emulated networkbandwidth. Finally, we record the end of the experiment whenthe last checkpoint is completely transferred to IC. We presentperformance results for 1GB data set. Similar results wereobserved for larger data sets.Case 1 − Checkpoint performance using WordCount:

Varying output to input ratio: Table I shows the readingsfor degree of overlap, extra network cost and reduction incompletion time using checkpointing schemes. We observethat both degree of overlap and reduction in job completiontime increase as the output to input ratio increases. Degree ofoverlap is highest (around 59%) when the output to input ratiois 0.8. A higher degree of overlap results in greater reductionin job completion time. For otoi = 0.8, checkpointing usingscheme S2 lowers job completion time by 28% incurring extranetwork cost of 11% only. Thus, checkpointing schemes aregreatly beneficial at higher output to input ratio.

Continuous updates during checkpoint: In Section VI-E,we had postulated that, since the rate at which output isproduced by Storm is much greater than the rate at whichdata can be downloaded, continuous update strategy wouldreduce the probability of retransmission and thereby decreasethe extra network cost. Now, we experimentally evaluate the



11

TABLE IICONTINUOUS UPDATION DURING CHECKPOINT TRANSFER

otoi = 0.8 S1 S2With Without With Without

dov (%) 43.7 36.3 58.9 52.1enc (%) 3.4 43.7 11.4 45.7rdc (%) 21.0 17.5 28.4 25.1

TABLE IIISCHEME 1 VS SCHEME 2: INVERTED INDEX APPLICATION

otoi = 1.42 S1 S2dov (%) 7.4 30enc (%) 1.4 17rdc (%) 4.4 18

TABLE IVPARAMETERS CONSIDERED FOR MODEL VALIDATION

Parameter Values

Job categoryWordCount, modified Word-Count with 50s, 100s, 150s,200s map execution times

Inter-cloud bandwidth 5MBps, 10MBpsParallel execution slots in IC 20, 28Output to input ratio 0.02, 0.7

benefits of continuous update strategy during checkpointing.Table II shows performance of checkpointing schemes withand without continuous updates for dataset whose outputto input ratio is 0.8. We observe that continuous updationconsiderably reduces extra network cost (≈40% for S1 and35% for S2). As the extra network cost is reduced greatly, thetotal time for downloading checkpoint data decreases whichin turn improves the degree of overlap and reduces the jobcompletion time. Thus, continuous update strategy enhancesthe performance of checkpointing.Case 2 − Checkpoint Performance using InvertedIndex:

In InvertedIndex, as the Storm reducer encounters the sameword again, it appends the file name and the count within thefile to the output string for that key. Thus, at each occurrenceof the same word, the output for that key grows in size.This is however not the case for WordCount because only thecount gets incremented. With checkpointing, a disjoint list offilenames and count is transmitted in each checkpoint. Becausethere is no overlap across checkpoint output strings for thesame key, checkpointing is expected to incur low extra networkcost while improving degree of overlap and completion time.Results validating this are shown in Table III. Further, we seethat S2 performs almost 4 times better than S1 in reducing thejob completion time, thereby proving to be more suitable forapplications such as InvertedIndex, where incremental reduceappends new values to the key’s output.

From Table I, we observe that S2 achieves a higher degreeof overlap compared to S1 for different output to input ratios.This also results in greater reduction in completion time forS2 than S1. Even though extra network cost for S2 is greaterthan S1, the absolute values of extra network cost are less than30%. As the speedup obtained using S2 is better than S1, weuse S2 for rest of the experiments.

4) Summary of model validation: Table IV summarizes thedifferent parameters considered for BStream validation. Weobserved a strong correlation (R2 = 0.98) between estimatedjob completion time and actual completion time (refer Fig. 10).

400 600 800 1000 1200 1400 1600 1800400

600

800

1000

1200

1400

1600

1800 Measured Completion Time Linear Fit of Measured Completion Time

Mea

sure

d C

ompl

etio

n Ti

me

(sec

onds

)

Estimated Completion Time (seconds)

Slope 0.99Adj. R-Square 0.98499

Fig. 10. Model Validation

Also the linear fit to the data has a slope of 0.99 indicatingthat the absolute values of estimated job completion time areclose to the actual completion time. Thus, we conclude thatthe analytical model provides reasonably accurate predictionof job completion time.

C. Comparison of BStream with Related Works

Section III-B showed that batch implementation of Hadoopin a multi-cloud setting incurs huge startup latencies for mapcomputation in EC. However, if maps in EC can stream readsfrom HDFS in IC, then the startup latencies can be reduced[7]–[9]. In [8], the authors proposed map-aware push andshuffle-aware map optimizations to decide how much (numberof maps) to burst. These techniques ensure that shufflingoutput from burst maps completes just-in-time with the endof shuffle phase in IC. Thus, they hide high latency requiredto download map output at Internet bandwidth. We imple-mented streaming reads with map-aware push and shuffle-aware map technique in YARN for comparing with BStream.Henceforth, we refer to that implementation as MaSa. Wealso implemented the task stealing approach proposed in [18].Henceforth, we refer to that implementation as TaskSteal.

1) Job characteristics: We consider a diverse set of jobswith different computation and data characteristics for ourexperiments (refer to Fig. 11). WordCount (WC), MultiWord-Count (MWC) and InvertedIndex (InvIndex) have the sameinput, 32GB Wikipedia dataset, but their output sizes vary,being very low, moderate and high respectively. Sort, WC100and WC200 are run on 24GB Synthetic dataset. In Sort, themappers just partition the keys among reducers while theactual sort is done by the reducers. Hence, the computationrequirement for each mapper is low in Sort. But the outputto input ratio is high (0.74). WC100 and WC200 sharesimilar output characteristics with Sort but their computationrequirements at the mapper are high (mt =100s) and veryhigh (mt =200s) respectively. In all these jobs, the reducecomputation required is very less compared to the map phasecomputation. We discuss some of the extensions to the modelrequired to handle jobs with significant reduce computation inSection IX.

2) Implementation Methodology: To realize MaSa, we im-plemented the map-aware push and shuffle-aware map tech-



12

Job Input size

(GB)

Input data

WordCount (WC) 32 Wiki

MultiWordCount (MWC) 32 WikiMultiWordCount (MWC) 32 Wiki

InvertedIndex (InvIndex) 32 Wiki

Sort 24 Syn

WC100 24 Syn

WC200 24 Syn

mt (in seconds)

MWC95WC100

WC200200

100

Very High Compute

High Compute

Moderate

Output to Input Ratio

WC

0.06

69InvIndex

0.7

MWC78

0.48 0.74

Sort26

ModerateCompute

LowCompute

High Data

ModerateData

Very LowData

Fig. 11. Job Characteristics

-30

-20

-10

0

10

20

30

40

50

Sp

eed

up

/Slo

wd

ow

n r

ela

tiv

e

to H

ad

oop

Pre

vG

en (

%)

TaskSteal MaSa Yarn

Fig. 12. Speedup Comparison of Related works based on PrevGen withYARN(in IC-only setting) - YARN with optimized reduce slowstart outper-forms other related works running in both clouds

niques as part of Hadoop scheduler. We set up a single Hadoopcluster spanning across IC and EC. The nodes in IC actas both data and compute nodes whereas the nodes in ECare configured as compute nodes only. This ensures that theinput data is stored only in IC but computation can happenboth in IC and EC. When a MapReduce job is submitted tothis Hadoop cluster, the nodes having free slots in EC pullthe input data from remote data nodes in IC. This allowsstreaming reads within maps (overlapping input data transferand processing within map). The number of maps burst isregulated by Hadoop scheduler considering map as well asshuffle duration in EC. TaskSteal is also implemented in asimilar way except that the pending map tasks are launchedin EC whenever the slots become free(task stealing withoutany regulation). We configured Hadoop to run reducers onlyin IC as the output data is required in IC. Thus, the reducersin IC shuffle intermediate data from IC and EC.

3) Performance Results: TaskSteal and MaSa are based onprevious version of Hadoop where there is a fixed partitioningof slots into map slots and reduce slots. Here, the reducersstart right at the beginning of the computation (default reduceslowstart = 0.05). In YARN, this fixed partitioning is no longermaintained. We illustrate that by setting reduce slowstartappropriately, YARN in IC-only setting can outperform theabove frameworks (running on both IC and EC) in many cases.Fig. 12 compares the speedup obtained using TaskSteal, MaSaand YARN relative to previous version of Hadoop (PrevGen).The number of slots allocated to a job in IC was fixed at

-60-50-40-30-20-10

0102030

Sp

eed

up

/Slo

wd

ow

n r

elati

ve

to H

ad

oo

p Y

AR

N (

%)

TaskSteal MaSa Bstream

Fig. 13. Speedup Comparison of BStream with Related works(modified forYARN)- BStream performs 8−15% better than other systems

40. Except in the case of WC200 that requires very highcomputation inthe map phase, YARN performs better thanTaskSteal and MaSa. Due to fixed partitioning of slots in pre-vious version of Hadoop, only 20 slots were used for mappersduring the entire job lifecycle for TaskSteal and MaSa. ForYARN, all the 40 slots were used for mappers until the reduceslowstart. Thus, due to higher utilization of slots, YARN’sperformance is much better. This shows the importance ofestimating reduce slowstart. As the above frameworks do notestimate reduce slowstart, we measured their performance atdifferent values of reduce slowstart and used the minimum jobcompletion time for comparison with BStream. Fig. 13 showsthe speedup obtained using TaskSteal, MaSa and BStreamrelative to YARN. Table V shows the corresponding cloudbursting parameters. TaskSteal bursts larger number of mapsto EC (indiscriminate task stealing). As the shuffle phase iscoupled with the reducer, shuffling of intermediate output fromEC begins only after reduce slowstart is satisfied. Until then,intermediate output accumulates on EC nodes. For shuffle-heavy applications, downloading the accumulated output fromEC delays the completion of reduce tasks in IC. Therefore,to handle elongated shuffle, reduce slowstart must be set verylow. But this decreases the utilization of slots in IC therebyresulting in poor job completion time. The job completion timefor MWC using TaskSteal is 20% worse compared to YARN.As compute to data ratio increases (e.g. WC, WC200), theperformance of TaskSteal improves as it now utilizes both ICand EC resources effectively.

MaSa overcomes the issue of indiscriminate task stealing



13

!t

TABLE VCOMPARISON OF CLOUD BURSTING PARAMETERS IN TASKSTEAL, MASA AND BSTREAM

RedSlowStart Maps Bursted Checkpointing in BStreamJob TaskSteal MaSa BStream TaskSteal MaSa BStream rdc(%) dov(%) enc(%)WC 0.75 0.75 0.96 131 131 91 0 0 0MWC 0.05 0.95 0.90 198 24 76 51 64 3InvIndex 0.25 0.95 0.95 172 24 72 7 13 5Sort 0.9 1.0 0.96 56 0 27 0 0 1WC100 0.25 0.95 0.96 129 24 73 32 67 25WC200 0.5 0.5 0.98 154 154 133 32 62 26

in TaskSteal by using map-aware push and shuffle-aware maptechniques. Thus, it performs better than TaskSteal. But thecoupling of shuffle with reduce limits the number of mapsthat can be burst thereby decreasing the utilization of resourcesin EC. Thus, MaSa performs better than YARN only in WC(very low output), MWC (high compute, moderate data) andWC200 (very high compute).

In BStream, the downloading of results from EC (shuffle)is decoupled from Hadoop reducers. Thus, unlike MaSa, bothnumber of maps burst and reduce slowstart in BStream aremuch higher thereby achieving better utilization of both IC andEC resources. Further, the downloading of results is optimizedin BStream - the download size is less due to reduce operationbeing performed in EC and, the download is overlapped withprocessing due to checkpointing. Checkpointing benefits jobswith high output to input ratio by providing a high degreeof overlap with low extra network cost (MWC, WC100 andWC200). The degree of overlap is around 65%, and extranetwork cost is less than 26% for these jobs. Thus, the speedupobtained by BStream is better than other systems (8−15%better) for all jobs except Sort and WC.

Sort application has low compute to data ratio for the maps,high output size and heavy reduce computation. As the reduceoperation is performed only in IC, Sort application does notbenefit from cloud bursting. Hence, its performance is poorfor all cloud bursting implementations. WC has a very lowoutput to input ratio (≈0.06). Hence, the speedup obtained byall the systems is almost the same.

In the next section, we analyze how different parametersaffect the performance of BStream. This will form the basisof determining job characteristics suitable for bursting, therebypaving way to develop algorithms for meeting deadlines formultiple jobs.

VIII. OPERATIONAL REGIME

As number of maps burst increases, load in IC decreases.Consequently, more jobs can be accommodated in IC. Byselecting a job which can burst greater number of maps to EC,more slots are available in IC to meet deadlines of other jobs.As the burst portion of the job is expected to complete beforethe end of map/shuffle phase in IC, there is no additionalwaiting time due to bursting. Therefore, a job which canburst more maps to EC will complete faster or show a greaterpercentage reduction in completion time with respect to IC-only execution.

From Eq. (12), the number of burst maps can be derived as:

M −MIC =M

1 +B ·Xterm.A

bEC(mt+R·kα )

(14)

50 100 150 2000

10

20

30

40

50

60

Red

uctio

n in

Com

plet

ion

Tim

e(%

)

Job Category m_t (seconds)

20slots-10MBps 40slots-10MBps 60slots-10MBps 20slots-5MBps 40slots-5MBps 60slots-5MBps

Fig. 14. Operational Regime for Map-critical jobs

Thus, we see that performance gain varies directly with totalnumber of maps and inversely with Xterm · A

bEC(mt+R·kα )

.From Eq. 12, Xterm stands for 1 + (1 − γ) · Obr where Obrdenotes output to input ratio and γ is the degree of overlapbetween output transfer and processing. From our experiments,we note that the average degree of overlap is around 50%.Substituting this value in expression for Xterm, we observethat performance gain is more for jobs having low output toinput ratio.

For map-critical jobs, the output to input ratio is low. So,we can substitute Xterm as 1 in Eq. (14). Performance gain(reduction in completion time) varies directly with inter-cloudbandwidth (bEC) and per-map execution time (mt). As bECincreases, more maps can be burst within a given time. Highervalue of mt implies higher compute to data ratio and hencea better bursting capacity. The performance results for map-critical jobs is shown in Fig. 14. For 20 slots in IC and mt =50s, the performance gain improves from 15% to 26% asbEC increases from 5MBps to 10MBps. Keeping bEC andnumber of slots in IC constant, if mt is increased to 200s, theperformance gain increases further to 60%. Thus we concludethat jobs having more number of maps, low output to inputratio and high per-map computation time must be preferredfor bursting. However, it is important to consider the availableslack time while selecting a job for bursting. The executionof maps proceeds as waves where the number of maps thatcomplete in a wave is equal to the number of parallel slots (A)and the duration of a wave is mt. Consider mt = 50s, A =60and number of maps as 256, then number of map waves is 5(= d256/60e) and duration of a wave is 50s. If A = 20, thenthe number of map waves is 13. If job deadline (T) is suchthat the number of permissible map waves (≈ (T − rt)/mt,where rt is the reduce time) is less, then the available slack



14

MR

Internal

External

RR1M RR2

MR

Ch1 Chk

2 stages of

Reduce

Fig. 15. Extension to BStream

time for bursting is also less. Therefore, only few maps can beburst for such jobs. Consequently, the performance gain due tobursting is not high. For example, performance gain is only 5%for mt = 50s, bEC= 10MBps when A is 60. These jobs arenot suitable for bursting and must be allocated more numberof slots in IC to meet deadline. On the other hand, jobs havingmore slack time have greater performance gain and should bepreferred for bursting. For example, performance gain is 26%for mt = 50s, bEC= 10MBps when A is 20. In conclusion,the characteristics of jobs preferred for bursting are high slacktime, low output to input ratio and high per-map computationtime. Designing heuristics to meet deadlines for multiple jobsis part of future work.

IX. BSTREAM LIMITATIONS AND EXTENSIONS

While Storm (in EC) supports incremental reduce, Hadoop(in IC) assumes that all the values for a given key are availableat once to the reducer. Therefore, the output from the burstportion of the job must be downloaded before the start ofreducer. This works fine for jobs where the computation inmap phase is significant12. But if the user-defined reducefunction is highly compute-intensive, the available slack timefor bursting decreases thereby lowering the performance gain.We plan to extend our model as shown in Fig. 15, wherethe reducer in IC is executed in two partial steps. At thebeginning of first step, the Hadoop reducer consumes whatevercheckpoint data is available. The rest of the output from ECis processed as part of the second step. This keeps the Stormtopology in EC busy for a much longer duration, therebyimproving performance.

In current version of BStream, user is required to writeHadoop and Storm code for a job. But by using Summingbirdthat allows same code to run on multiple platforms includingHadoop and Storm13, this requirement may not be necessary.The analytical model assumes there is no map/reduce skewand provides support for meeting deadline only for a singlejob. Extending BStream with online profiling to handle skewand supporting multiple jobs in a multi-cloud setting are partof future work. We would also like to explore issues regardinginitial placement of data. With input data present only inIC, the portion of job that can be burst and the consequentacceleration are limited by the inter-cloud bandwidth.

X. CONCLUSION

In this paper, we presented BStream, which extends Hadoopto multiple clouds by combining stream processing in EC withbatch processing in IC. BStream uses an analytical model

12In general, MapReduce jobs have compute-intensive map and IO-intensive reduce operations

13https://github.com/twitter/summingbird

to estimate resource allocation and task distribution acrossclouds to meet deadlines. We showed that the performanceof analytical model is reasonably accurate. We compared theperformance of BStream with other existing works and showedthat stream processing along with continuous checkpointingin EC can significantly improve performance. Finally, wecharacterized the operational regime of BStream, paving wayfor meeting deadlines with multiple jobs.

ACKNOWLEDGMENT

The authors would like to thank Yahoo for donating serversto Yahoo! Grid Lab at IIT Madras. Hadoop Cluster setup onthem was used for the experiments.

REFERENCES

[1] D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg,H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt,and A. Aiyer, “Apache hadoop goes realtime at facebook,” in Proc. ACMInt’l Conf. on Manag. of Data (SIGMOD ’11), 2011, pp. 1071–1080.

[2] C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neu-mann, V. B. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell,and X. Wang, “Nova: continuous Pig/Hadoop workflows,” in Proc. ACMInt’l Conf. on Manag. of Data (SIGMOD ’11), 2011, pp. 1081–1090.

[3] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, andR. Sears, “Mapreduce online,” in Proc. 7th USENIX Conf. on NetworkedSyst. Design and Implementation (NSDI’10), 2010.

[4] H. Kim, S. Chaudhari, M. Parashar, and C. Marty, “Online risk analyticson the cloud,” in Proc. IEEE/ACM 9th Int’l Symp. Cluster Comput. andthe Grid, 2009, pp. 484–489.

[5] C. Vecchiola, R. N. Calheiros, D. Karunamoorthy, and R. Buyya,“Deadline-driven provisioning of resources for scientific applications inhybrid clouds with Aneka,” Future Generation Comp. Syst., vol. 28,no. 1, pp. 58–65, 2012.

[6] S. Kailasam, N. Gnanasambandam, J. Dharanipragada, and N. Sharma,“Optimizing ordered throughput using autonomic cloud bursting sched-ulers,” IEEE Transactions on Software Engineering, vol. 39, no. 11, pp.1564–1581, 2013.

[7] S. Kim, J. Won, H. Han, H. Eom, and H. Y. Yeom, “Improving hadoopperformance in intercloud environments,” SIGMETRICS Perform. Eval.Rev., vol. 39, no. 3, pp. 107–109, Dec. 2011.

[8] B. Heintz, C. Wang, A. Chandra, and J. Weissman, “Cross-phaseoptimization in mapreduce,” in Proc. IEEE Int’l Conf. on Cloud Eng.(IC2E ’13), 2013.

[9] B. Heintz, A. Chandra, and R. K. Sitaraman, “Optimizingmapreduce for highly distributed environments,” Univ. ofMinnesota, Tech. Rep., Feb. 2012. [Online]. Available:http://www.cs.umn.edu/tech reports upload/tr2012/12-003.pdf

[10] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing onlarge clusters,” in Proc. 6th USENIX Conf. on Operating Syst. Designand Implementation (OSDI’04), 2004.

[11] K. Kc and K. Anyanwu, “Scheduling hadoop jobs to meet deadlines,” inProc. 2nd IEEE Int’l Conf. on Cloud Comput. Technology and Science(CLOUDCOM ’10), 2010, pp. 388–392.

[12] A. Verma, L. Cherkasova, and V. S. Kumar, “Deadline-based workloadmanagement for mapreduce environments: Pieces of the performancepuzzle,” in Proc. IEEE/IFIP Int’l Conf. on Network Operations andManag. Symp. (NOMS ’12), 2012.

[13] A. Verma, L. Cherkasova, and R. H. Campbell, “ARIA: AutomaticResource Inference and Allocation for Mapreduce environments,” inProc. 8th ACM Int’l Conf. on Autonomic Comput. (ICAC ’11), 2011,pp. 235–244.

[14] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, andS. Babu, “Starfish: A self-tuning system for big data analytics,” in Proc.5th Int’l Conf. on Innovative Data Syst. Research (CIDR ’11), 2011, pp.261–272.

[15] A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca,“Jockey: guaranteed job latency in data parallel clusters,” in Proc. 7thACM European Conf. on Comp. Syst. (EuroSys ’12), 2012, pp. 99–112.

[16] Y. Luo, Z. Guo, Y. Sun, B. Plale, J. Qiu, and W. W. Li, “A hierarchicalframework for cross-domain mapreduce execution,” in Proc. 2nd ACMInt’l Wksp. on Emerging Computational Methods for the Life Sciences(ECMLS ’11), 2011, pp. 15–22.



15

[17] M. Michael, C. Rodrigo N., and R. Buyya, “Scaling mapreduce ap-plications across hybrid clouds to meet soft deadlines,” in Proc. 27thIEEE Int’l Conf. on Advanced Information Networking and Applications(AINA ’13), 2013, pp. 629–636.

[18] T. Bicer, D. Chiu, and G. Agrawal, “A framework for data-intensivecomputing with cloud bursting,” in Proc. IEEE Int’l Conf. on ClusterComput., 2011, pp. 169–177.

[19] ——, “Time and cost sensitive data-intensive computing on hybridclouds,” in Proc. 12th IEEE/ACM Int’l Symp. on Cluster, Cloud andGrid Comput. (CCGrid), 2012, pp. 636–643.

[20] S. Hegde, “Autonomic cloudbursting for mapreduce frame-work using a deadline based scheduler,” Master’s thesis,Univ. of Rutgers, NJ, May 2011. [Online]. Available:http://nsfcac.rutgers.edu/people/parashar/Papers/samprita thesis.pdf

[21] H. Kim and M. Parashar, “Cometcloud: An autonomic cloud engine,”in Cloud Computing: Principles and Paradigms. Wiley, 2011, ch. 10,pp. 275–297.

[22] R. Kienzler, R. Bruggmann, A. Ranganathan, and N. Tatbul, “StreamAs You Go: The Case for Incremental Data Access and Processing inthe Cloud,” in Proc. IEEE Int’l Workshop on Data Manag. in the Cloud(DMC’12), 2012.

[23] D. Logothetis, C. Trezzo, K. C. Webb, and K. Yocum, “In-situ mapre-duce for log processing,” in Proc. USENIX Annu. Tech. Conf. (USENIX-ATC’11), 2011.

[24] M. Cardosa, C. Wang, A. Nangia, A. Chandra, and J. Weissman,“Exploring mapreduce efficiency with highly-distributed data,” in Proc.2nd ACM Int’l Wksp. on MapReduce and its Applications (MapReduce’11), 2011, pp. 27–34.

[25] H. Gadre, I. Rodero, and M. Parashar, “Investigating mapreduce frame-work extensions for efficient processing of geographically scattereddatasets,” SIGMETRICS Perform. Eval. Rev., vol. 39, no. 3, pp. 116–118,Dec. 2011.

[26] H. Lin, X. Ma, J. Archuleta, W.-c. Feng, M. Gardner, and Z. Zhang,“Moon: Mapreduce on opportunistic environments,” in Proc. 19th ACMInt’l Symp. on High Performance Distrib. Comput. (HPDC ’10). ACM,2010, pp. 95–106.

[27] F. Costa, L. Veiga, and P. Ferreira, “Internet-scale support for map-reduce processing,” Journal of Internet Services and Applications, vol. 4,no. 1, pp. 1–17, 2013.

[28] F. Marozzo, D. Talia, and P. Trunfio, “P2P-MapReduce: Parallel dataprocessing in dynamic cloud environments,” J. Comput. Syst. Sci.,vol. 78, no. 5, pp. 1382–1402, Sep. 2012.

Sriram Kailasam is a Ph.D. student of ComputerScience and Engineering at Indian Institute of Tech-nology Madras. His research interests are in auto-nomic scheduling in peer-to-peer and cloud-basedsystems, parallel programming models for the cloud,BigData analytics, and health grid.

Prateek Dhawalia is a MS student of Computer Sci-ence and Engineering at Indian Institute of Technol-ogy Madras. His research interests include dynamicadaptation of MapReduce computations, BigDataanalytics, and operating systems.

S J Balaji is a MS student of Computer Scienceand Engineering at Indian Institute of TechnologyMadras. He received a BE degree in Electronicsand Telecommunication Engineering from MumbaiUniversity in 2010. His research interests are operat-ing systems design, distributed systems, cloud-basedsystems, energy aware system designs and manycoreoperating systems.

Geeta Iyer is a MS student of Computer Scienceand Engineering at Indian Institute of TechnologyMadras. Her research work focused on exploringparallel programming models for cloud environmentto support shared data structure abstraction andrecursive computations. She is currently working asa Software Engineer at Ebay Inc.

Janakiram Dharanipragada is currently a profes-sor in the Department of Computer Science andEngineering, Indian Institute of Technology (IIT)Madras, India, where he heads and coordinates theresearch activities of the Distributed and ObjectSystems Lab. He obtained his Ph.D degree fromIIT, Delhi. His current research focus is on buildinglarge scale distributed systems focusing on designpattern based techniques, measurements, peer-peermiddleware based grid systems, cloud bursting, etc.He is currently an associate editor of IEEE Trans-

actions on Cloud Computing, the SIG Chair of Distributed Computing ofComputer Society of India, Chair of ACM Chennai Chapter and is also thefounder of the Forum for Promotion of Object Technology in India. He is theprincipal investigator for a number of projects which include the MobiTel:Mobile Telemedicine for rural India, Peer-to-peer Concept Search (Indo-ItalianCollaborative project), Service Oriented Architecture for Linux Kernel (DIT)and Cloud Bursting Architecture for Document workflow (Xerox).

Documents

Extending MapReduce across Clouds with BStream