Efficient On-Demand Operations in Large-Scale Infrastructures

An Intermediate Data Storage for Dataflow Programs

Efficient On-Demand Operations in Large-Scale InfrastructuresFinal Defense Presentationby Steven Y. KoThesis FocusOne class of operations that faces one common set of challenges cutting across four diverse and popular types of distributed infrastructures2Large-Scale InfrastructuresAre everywhere3

Wide-area peer-to-peer BitTorrent, LimeWire, etc. For Web users

Grids TeraGrid, Grid5000, For scientific communitiesResearch testbeds CCT, PlanetLab, Emulab, For computer scientistsInternet

Clouds Amazon, Google, HP, IBM, For Web users, app developersOn-Demand OperationsOperations that act upon the most up-to-date state of the infrastructureE.g., whats going on right now in my data center?Four operations in the thesisOn-demand monitoring (prelim)Clouds, research testbedsOn-demand replication (this talk)Clouds, research testbedsOn-demand schedulingGridsOn-demand key/value storePeer-to-peerDiverse infrastructures & essential operationsAll have one common set of challenges

4Common ChallengesScale and dynamismOn-demand monitoringScale: 200K machines (Google)Dynamism: values can change any time (e.g., CPU-util)On-demand replicationScale: ~300TB total, adding 2TB/day (Facebook HDFS)Dynamism: resource availability (failures, b/w, etc.)On-demand schedulingScale: 4K CPUs running 44,000 tasks accessing 588,900 files (Coadd)Dynamism: resource availability (CPU, mem, disk, etc.)On-demand key/value storeScale: Millions of peers (BitTorrent)Dynamism: short-term unresponsiveness5Common ChallengesScale and dynamism6ScaleDynamismStaticDynamicOn-DemandOperationsOn-Demand Monitoring (Recap)NeedUsers and admins need to query & monitor the most-up-to-date attributes (CPU-util, disk-cap, etc.) of an infrastructureScale # of machines, the amount of monitoring data, etc.DynamismStatic attributes (e.g., CPU-type) vs. dynamic attributes (e.g., CPU-util)MoaraExpressive queries, e.g., avg. CPU-util where disk > 10 G or (mem-util < 50% and # of processes < 20)Quick response time and bandwidth-efficiency7Common ChallengesOn-demand monitoring: DB vs. Moara

8(Data)Scale(Attribute)DynamismCentralized(e.g., DB)Scalingcentralized(e.g., Replicated DB) StaticAttributes(e.g., CPU type)DynamicAttributes(e.g., CPU util.)Centralized(e.g., DB)My Solution (Moara),etc.Common ChallengesOn-demand replication of intermediate data (in detail later)9(Data)Scale(Availability)DynamismLocal FileSystems(e.g., compilers)Distributed FileSystems(e.g., Hadoop w/ HDFS)StaticDynamicReplication(e.g., Hadoop w/ HDFS)My Solution(ISS)Thesis StatementOn-demand operations can be implemented efficiently in spite of scale and dynamism in a variety of distributed systems.10Thesis StatementOn-demand operations can be implemented efficiently in spite of scale and dynamism in a variety of distributed systems.Efficiency: responsiveness & bandwidth11Thesis ContributionsIdentifying on-demand operations as an important class of operations in large-scale infrastructuresIdentifying scale and dynamism as two common challengesArguing the need for an on-demand operation in each caseShowing how efficient each can be by simulations and real deployments12Thesis OverviewISS (Intermediate Storage System)Implements on-demand replicationUSENIX HotOS 09MoaraImplements on-demand monitoringACM/IFIP/USENIX Middleware 08Worker-centric schedulingImplements on-demand schedulingACM/IFIP/USENIX Middleware 07MPIL (Multi-Path Insertion and Lookup)Implements on-demand key/value storeIEEE DSN 0513Thesis Overview14



Clouds Amazon, Google, HP, IBM, For Web users, app developersMoara & ISSWorker-Centric SchedulingMPILRelated WorkManagement operations: MON [Lia05], vxargs, PSSH[Pla], etc.Scheduling [Ros04, Ros05, Vis04]Gossip-based multicast [Bir02, Gup02, Ker01]On-demand scalingAmazon AWS, Google AppEngine, MS Azure, RightScale, etc.15ISS: Intermediate Storage SystemOur PositionIntermediate data as a first-class citizen for dataflow programming frameworks in clouds17Our PositionIntermediate data as a first-class citizen for dataflow programming frameworks in cloudsDataflow programming frameworks18Our PositionIntermediate data as a first-class citizen for dataflow programming frameworks in cloudsDataflow programming frameworksThe importance of intermediate dataScale and dynamism19Our PositionIntermediate data as a first-class citizen for dataflow programming frameworks in cloudsDataflow programming frameworksThe importance of intermediate dataScale and dynamismISS (Intermediate Storage System) with on-demand replication20Dataflow Programming FrameworksRuntime systems that execute dataflow programsMapReduce (Hadoop), Pig, Hive, etc.Gaining popularity for massive-scale data processingDistributed and parallel execution on clustersA dataflow program consists ofMulti-stage computationCommunication patterns between stages2121Example 1: MapReduceTwo-stage computation with all-to-all comm.Google introduced, Yahoo! open-sourced (Hadoop)Two functions Map and Reduce supplied by a programmerMassively parallel execution of Map and Reduce22

Stage 1: MapStage 2: ReduceShuffle (all-to-all)Example 2: Pig and HiveMulti-stage with either all-to-all or 1-to-1

Stage 1: MapStage 2: ReduceStage 3: MapStage 4: Reduce23Shuffle (all-to-all)1-to-1 comm.UsageGoogle (MapReduce)Indexing: a chain of 24 MapReduce jobs~200K jobs processing 50PB/month (in 2006)Yahoo! (Hadoop + Pig)WebMap: a chain of 100 MapReduce jobsFacebook (Hadoop + Hive)~300TB total, adding 2TB/day (in 2008)3K jobs processing 55TB/dayAmazonElastic MapReduce service (pay-as-you-go)Academic cloudsGoogle-IBM Cluster at UW (Hadoop service)CCT at UIUC (Hadoop & Pig service)24One Common CharacteristicIntermediate dataIntermediate data? data between stagesSimilarities to traditional intermediate data [Bak91, Vog99]E.g., .o filesCritical to produce the final outputShort-lived, written-once and read-once, & used-immediatelyComputational barrier25Difference: networked & failure-prone25One Common CharacteristicComputational Barrier26

Stage 1: MapStage 2: ReduceComputational BarrierWhy Important?Scale and DynamismLarge-scale: possibly very large amount of intermediate dataDynamism: Loss of intermediate data=> the task cant proceed27

Stage 1: MapStage 2: ReduceFailure Stats5 average worker deaths per MapReduce job (Google in 2006)One disk failure in every run of a 6-hour MapReduce job with 4000 machines (Google in 2008)50 machine failures out of 20K machine cluster (Yahoo! in 2009)28Hadoop Failure Injection ExperimentEmulab setting20 machines sorting 36GB4 LANs and a core switch (all 100 Mbps)1 failure after MapRe-execution of Map-Shuffle-Reduce~33% increase in completion time29MapShuffleReduceMapShuffleReduceRe-Generation for Multi-StageCascaded re-execution: expensive

Stage 1: MapStage 2: ReduceStage 3: MapStage 4: Reduce30Importance of Intermediate DataWhy?Scale and dynamismA lot of data + when lost, very costlyCurrent systems handle it themselves.Re-generate when lost: can lead to expensive cascaded re-executionWe believe that the storage can provide a better solution than the dataflow programming frameworks31Storage is the right abstraction, not the dataflow programming frameworks.31Our PositionIntermediate data as a first-class citizen for dataflow programming frameworks in cloudsDataflow programming frameworksThe importance of intermediate dataISS (Intermediate Storage System)Why storage?ChallengesSolution hypothesesHypotheses validation32Why Storage? On-Demand ReplicationStops cascaded re-execution33

Stage 1: MapStage 2: ReduceStage 3: MapStage 4: ReduceSo, Are We Done?No!Challenge: minimal interferenceNetwork is heavily utilized during Shuffle.Replication requires network transmission too, and needs to replicate a large amount.Minimizing interference is critical for the overall job completion time.Efficiency: completion time + bandwidthHDFS (Hadoop Distributed File System): much interference34Realism of choice: Googles GFS is used to store reduce data & Yahoos Pig uses HDFS to store reduce data34Default HDFS InterferenceOn-demand replication of Map and Reduce outputs (2 copies in total)35Background Transport ProtocolsTCP-Nice [Ven02] & TCP-LP [Kuz06]Support background & foreground flowsProsBackground flows do not interfere with foreground flows (functionality)ConsDesigned for wide-area InternetApplication-agnosticNot designed for data center replicationCan do better!36Revisiting Common ChallengesScale: need to replicate a large amount of intermediate dataDynamism: failures, foreground traffic (bandwidth availability)Two challenges create much interference.37Revisiting Common ChallengesIntermediate data management

38(Data)Scale(Availability)DynamismLocal FileSystems(e.g., compilers)Distributed FileSystems(e.g., Hadoop w/ HDFS)StaticDynamicReplication(e.g., Hadoop w/ HDFS)ISSOur PositionIntermediate data as a first-class citizen for dataflow programming frameworks in cloudsDataflow programming frameworksThe importance of intermediate dataISS (Intermediate Storage System)Why is storage the right abstraction?ChallengesSolution hypothesesHypotheses validation39Three HypothesesAsynchronous replication can help.HDFS replication works synchronously.The replication process can exploit the inherent bandwidth heterogeneity of data centers (next).Data selection can help (later).40Bandwidth HeterogeneityData center topology: hierarchicalTop-of-the-rack switches (under-utilized)Shared core switch (fully-utilized)41

Data SelectionOnly replicate locally-consumed data42

Stage 1: MapStage 2: ReduceStage 3: MapStage 4: ReduceThree HypothesesAsynchronous replication can help.The replication process can exploit the inherent bandwidth heterogeneity of data centers.Data selection can help.

The question is not if, but how much.If effective, these become techniques.43Experimental SettingEmulab with 80 machines4 X 1 LAN with 20 machines4 X 100Mbps top-of-the-rack switch1 X 1Gbps core switchVarious configurations give similar results.Input data: 2GB/machine, random-generationWorkload: sort5 runsStd. dev. ~ 100 sec.: small compared to the overall completion time2 replicas of Map outputs in total44Asynchronous ReplicationModification for asynchronous replicationWith an increasing level of interferenceFour levels of interferenceHadoop: original, no replication, no interferenceRead: disk read, no network transfer, no actual replicationRead-Send: disk read & network send, no actual replicationRep.: full replication45Asynchronous ReplicationNetwork utilization makes the differenceBoth Map & Shuffle get affectedSome Maps need to read remotely46Remind that Map & Shuffle overlap46Three Hypotheses (Validation)Asynchronous replication can help, but still cant eliminate the interference.The replication process can exploit the inherent bandwidth heterogeneity of data centers.Data selection can help.47Rack-Level ReplicationRack-level replication is effective.Only 20~30 rack failures per year, mostly planned (Google 2008)48Three Hypotheses (Validation)Asynchronous replication can help, but still cant eliminate the interferenceThe rack-level replication can reduce the interference significantly.Data selection can help.49Locally-Consumed Data ReplicationIt significantly reduces the amount of replication.50Three Hypotheses (Validation)Asynchronous replication can help, but still cant eliminate the interferenceThe rack-level replication can reduce the interference significantly.Data selection can reduce the interference significantly.51ISS Design OverviewImplements asynchronous rack-level selective replication (all three hypotheses)Replaces the Shuffle phaseMapReduce does not implement Shuffle.Map tasks write intermediate data to ISS, and Reduce tasks read intermediate data from ISS.Extends HDFS (next)52ISS Design OverviewExtends HDFSiss_create()iss_open()iss_write()iss_read()iss_close()Map tasksiss_create() => iss_write() => iss_close()Reduce tasksiss_open() => iss_read() => iss_close()53Hadoop + Failure75% slowdown compared to no-failure Hadoop54Hadoop + ISS + Failure(Emulated) 10% slowdown compared to no-failure HadoopSpeculative execution can leverage ISS.55Replication Completion TimeReplication completes before Reduce+ indicates replication time for each block56SummaryOur positionIntermediate data as a first-class citizen for dataflow programming frameworks in cloudsProblem: cascaded re-executionRequirementsIntermediate data availability (scale and dynamism)Interference minimization (efficiency)Asynchronous replication can help, but still cant eliminate the interferenceThe rack-level replication can reduce the interference significantly.Data selection can reduce the interference significantly.Hadoop & Hadoop + ISS show comparable completion times.57Thesis Overview58



Clouds Amazon, Google, HP, IBM, For Web users, app developersMoara & ISSWorker-Centric SchedulingMPILOn-Demand SchedulingOne-line summaryOn-demand scheduling of Grid tasks for data-intensive applicationsScale# of tasks, # of CPUs, the amount of dataDynamismresource availability (CPU, mem, disk, etc.)Worker-centric schedulingWorkers availability is the first-class criterion in scheduling tasks.59On-Demand SchedulingTaxonomy60ScaleDynamismStatic SchedulingTask-CentricSchedulingStaticDynamicTask-CentricSchedulingWorker-CentricSchedulingOn-Demand Key/Value StoreOne-line summaryOn-demand key/value lookup algorithm under dynamic environmentsScale# of peersDynamismshort-term unresponsivenessMPILA DHT-style lookup algorithm that can operate over any topology and is resistant to perturbation

61On-Demand Key/Value StoreTaxonomy62ScaleDynamismNapster-styleCentral DirectoryDHTsStaticDynamicNapster-styleCentral DirectoryMPILConclusionOn-demand operations can be implemented efficiently in spite of scale and dynamism in a variety of distributed systems.Each on-demand operation has a different need, but a common set of challenges.63References[Lia05] J. Liang, S. Y. Ko, I. Gupta, and K. Nahrstedt. MON: On- demand Overlays for Distributed System Management. In USENIX WORLDS, 2005[Pla] PlanetLab. http://www.planet-lab.org/[Ros05] A. L. Rosenberg and M. Yurkewych. Guidelines for Scheduling Some Common Computation-Dags for Internet-Based Computing. IEEE TC, Vol. 54, No. 4, April 2005.[Ros04] A. L. Rosenberg. On Scheduling Mesh-Structured Computations for Internet-Based Computing. IEEE TC, 53(9), September 2004.[Vis04] S. Viswanathan, B. Veeravalli, D. Yu, and T. G. Robertazzi. Design and Analysis of a Dynamic Scheduling Strategy with Resource Estimation for Large-Scale Grid Systems. In IEEE/ACM GRID, 2004.[Bir02] K. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky. Bimodal Multicast.ACM TCSystems, 17(2):4188, May 2002.[Gup02] I. Gupta, A.-M. Kermarrec, and A. J. Ganesh. Ecient Epidemic-Style Protocols for Reliable and Scalable Multicast. In SRDS, 2002.[Ker01] A.-M. Kermarrec, L. Massouli, and A. J. Ganesh. Probabilistic reliable dissemination in large-scale systems. IEEE TPDS, 14:248258, 2001.[Vog99] W. Vogels. File System Usage in Windows NT 4.0. In SOSP, 1999.[Bak91] M. G. Baker, J. H. Hartman, M. D. Kupfer, K. W. Shirri, and J. K. Ousterhout. Measurements of a Distributed File System. SIGOPS OSR, 25(5), 1991.[Ven02] A. Venkataramani, R. Kokku, and M. Dahlin. TCP Nice: A Mechanism for Background Transfers. In OSDI, 2002.[Kuz06] A. Kuzmanovic and E. W. Knightly. TCP-LP: Low-Priority Service via End-Point Congestion Control. IEEE/ACM TON, 14(4):739752, 2006.64Backup Slides65Hadoop ExperimentEmulab setting20 machines sorting 36GB4 LANs and a core switch (all 100 Mbps)Normal execution: MapShuffleReduce66MapShuffleReduceWorker-Centric SchedulingOne-line summaryOn-demand scheduling of Grid tasks for data-intensive applicationsBackgroundData-intensive Grid applications are divided into chunks.Data-intensive applications access a large set of files file transfer time is a bottleneck (from Coadd experience)Different tasks access many files together (locality)ProblemHow to schedule tasks in a data-intensive application that access a large set of files?

67Worker-Centric SchedulingSolution: on-demand scheduling based on workers availability and reusing of filesTask-centric vs. worker-centric: consideration of workers availability for executionWe have proposed a series of worker-centric scheduling heuristics that minimizes file transfersResultTested our heuristics with a real traceCompared them with the state-of-the-art task-centric schedulingShowed the task completion time reduction up to 20%

68MPIL (Multi-Path Insertion/Lookup)One-line summaryOn-demand key/value lookup algorithm under dynamic environmentsBackgroundVarious DHTs (Distributed Hash Tables) have been proposed.Real-world trace studies show churn is a threat to DHTs performance.DHTs employ aggressive maintenance algorithms to combat churn low lookup cost, but high maintenance cost

69MPIL (Multi-Path Insertion/Lookup)Alternative: MPILIs a new DHT routing algorithmHas low maintenance cost, but slightly higher lookup costFeaturesUses multi-path routing to combat dynamism (perturbation in the environment).Runs over any overlay no need for a maintenance algorithm

70

0 5

10 15 20 25 30 35 40

0 200 400 600 800 1000 1200 1400 1600 1800

# of

Tas

ks

Time (sec)

MapShuffle

Reduce

0 500

1000 1500 2000 2500 3000 3500 4000

Normal R MR

Com

pletio

n Ti

me

(sec

)

Replication

MapShuffle

Reduce

0 200 400 600 800

1000 1200 1400 1600 1800 2000 2200

Hadoop Read Read-Send Rep.

Com

pletio

n Ti

me

(sec

)

Interference Level

MapShuffle

Reduce

0 200 400 600 800

1000 1200 1400

Hadoop HDFS Rack-Rep.

Com

pletio

n Ti

me

(sec

)

Replication Type

MapShuffle

Reduce

0 200 400 600 800

1000 1200 1400

Hadoop HDFS Local-Rep.

Com

pletio

n Ti

me

(sec

)

Replication Type

MapShuffle

Reduce

0 20 40 60 80

100 120 140 160 180

0 200 400 600 800 1000 1200 1400 1600

# of

Tas

ks

Time (sec)

MapShuffle

Reduce

0 20 40 60 80

100 120 140 160 180

0 200 400 600 800 1000 1200 1400 1600

# of

Tas

ks

Time (sec)

MapShuffle

Reduce

0

50

100

150

200

0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200

# of

Tas

ks

Rep.

Dur

ation

(sec

)

Time (sec)

MapShuffle

Reduce

0 5

10 15 20 25 30 35 40

0 200 400 600 800 1000 1200 1400 1600 1800

# of

Tas

ks

Time (sec)

MapShuffle

Reduce

Documents

Efficient On-Demand Operations in Large-Scale Infrastructures