Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice

Workflow Allocations and Scheduling on IaaSPlatforms, from Theory to Practice

Eddy Caron1, Frederic Desprez2, Adrian Mures, an1, FredericSuter3, Kate Keahey4

1Ecole Normale Superieure de Lyon, France

2 INRIA

3IN2P3 Computing Center, CNRS, IN2P3

4UChicago, Argonne National Laboratory

Joint-lab workshop

Outline

Context

TheoryModelsProposed solutionSimulations

PracticeArchitectureApplicationExperimentation

Conclusions and perspectives

Workflows are a common pattern in scientific applications

I applications built on legacy code

I applications built as an aggregate

I use inherent task parallelism

I phenomenons having inherent workflow structure

Workflows are omnipresent!

(a) Ramses (b) Montage

(c) Ocean current modeling (d) Epigenomics

Figure Workflow application examples

Classic model of resource provisioning

I static allocations in a grid environment

I researchers compete for resources

I researchers tend to over-provision and under-use

I workflow applications have a non-constant resource demand

This is inefficient, but can it be improved?

Yes!How?

I on-demand resources

I automate resource provisioning

I smarter scheduling strategies






This is inefficient, but can it be improved?Yes!How?









This is inefficient, but can it be improved?Yes!How?




Why on-demand resources?

I more efficient resource usage

I eliminate overbooking of resources

I can be easily automated

I unlimited resources ∗

=⇒ budget limit

Our goal

I consider a more general model of workflow apps

I consider on-demand resources and a budget limit

I find a good allocation strategy





I unlimited resources ∗ =⇒ budget limit

Our goal








I unlimited resources ∗ =⇒ budget limit

Our goal




Related work

Functional workflowsBahsi, E.M., Ceyhan, E., Kosar, T.: Conditional Workflow Management: A Survey and Analysis. Scientific

Programming 15(4), 283–297 (2007)

biCPADesprez, F., Suter, F.: A Bi-Criteria Algorithm for Scheduling Parallel Task Graphs on Clusters. In: Proc.

of the 10th IEEE/ACM Intl. Symposium on Cluster, Cloud and Grid Computing. pp. 243–252 (2010)

Chemical programming for workflow applicationsFernandez, H., Tedeschi, C., Priol, T.: A Chemistry Inspired Workflow Management System for Scientific

Applications in Clouds. In: Proc. of the 7th Intl. Conference on E-Science. pp. 39–46 (2011)

PegasusTMalawski, M., Juve, G., Deelman, E. and Nabrzyski, J..: Cost- and Deadline-Constrained Provisioning

for Scientific Workflow Ensembles in IaaS Clouds. 24th IEEE/ACM International Conference onSupercomputing (SC12) (2012)

Outline

Context




1

3

4

2

(a) Classic

1

3

4

2

(b) PTG

1

3

4

2

Loop

Or

(c) Functional

Figure Workflow types

Application model

Non-deterministic (functional) workflows

An application is a graph G = (V, E), where

V = {vi |i = 1, . . . , |V |} is a set of vertices

E = {ei ,j |(i , j) ∈ {1, . . . , |V |} × {1, . . . , |V |}} is a set of edgesrepresenting precedence and flow constraints

Vertices

I a computational task [parallel, moldable]

I an OR-split [transitions described by random variables]

I an OR-join

Example workflow

Figure Example workflow

Platform modelA provider of on-demand resources from a catalog:C = {vmi = (nCPUi , costi )|i ≥ 1}

nCPU represents the number of equivalent virtual CPUs

cost represents a monetary cost per running hour(Amazon-like)

communication bounded multi-port model

Makespan

C = maxi C (vi ) is the global makespan where

C (vi ) is the finish time of task vi ∈ V

Cost of a schedule SCost =

∑∀vmi∈SdTendi

− Tstarti e × costi

Tstarti ,Tendirepresent the start and end times of vmi

costi is the catalog cost of virtual resource vmi

Problem statementGiven

G a workflow application

C a provider of resources from the catalog

B a budget

find a schedule S such that

Cost ≤ B budget limit is not passed

C (makespan) is minimized

with a predefined confidence.

Proposed approach

1. Decompose the non-DAG workflow into DAG sub-workflows

2. Distribute the budget to the sub-workflows

3. Determine allocations by adapting an existing allocationapproach

Step 1: Decomposing the workflow

Figure Decomposing a nontrivial workflow

Step 2: Allocating budget

1. Compute the number of executions of each sub-worflow

I # of transitions of the edge connecting its parent OR-split toits start node

I Described by a random variable according to a distinct normaldistribution + confidence parameter

2. Give each sub-workflow a ratio of the budget proportional to itswork contribution.

Work contribution of a sub-workflow G i

I as the sum of the average execution times of its tasks

I average execution time computed over the catalog CI task speedup model is taken into consideration

I multiple executions of a sub-workflow also considered

Step 3: Determining allocations

Two algorithms based on the bi-CPA algorithm.

Eager algorithm

I one allocation for each task

I good trade-off between makespan and average time-cost area

I fast algorithm

I considers allocation-time cost estimations only

Deferred algorithm

I outputs multiple allocations for each task


I slower algorithm

I one allocation is chosen at scheduling time

Step 3: Determining allocations

Two algorithms based on the bi-CPA algorithm.

Eager algorithm

I one allocation for each task


I fast algorithm

I considers allocation-time cost estimations only

Deferred algorithm

I outputs multiple allocations for each task


I slower algorithm

I one allocation is chosen at scheduling time

Algorithm parameters

T overA , T under

A average work allocated to tasks

TCP duration of the critical path

B ′ estimation of the used budget when TA and TCP

meet

I TA keeps increasing as we increase theallocation of tasks and TCP keeps decreasing sothey will eventually meet.

I When they do meet we have a trade-off betweenthe average work in tasks and the makespan.

p(vi ) number of processing units allocated to task vi

The eager allocation algorithm

1: for all v ∈ V i do2: Alloc(v)← {minvmi∈C CPUi}3: end for4: Compute B ′

5: while TCP > T overA ∩

∑|V i |j=1 cost(vj ) ≤ B i do

6: for all vi ∈ Critical Path do7: Determine Alloc ′(vi ) such that p′(vi ) = p(vi ) + 1

8: Gain(vi )← T (vi ,Alloc(vi ))p(vi )

− T (vi ,Alloc ′(vi ))p′(vi )

9: end for10: Select v such that Gain(v) is maximal11: Alloc(v)← Alloc ′(v)12: Update T over

A and TCP

13: end whileAlgorithm 1: Eager-allocate(G i = (V i , E i ),B i )

Methodology

I Simulation using SimGridI Used 864 synthetic workflows for three types of applications

I Fast Fourier TransformI Strassen matrix multiplicationI Random workloads

I Used a virtual resource catalog inspired by Amazon EC2

I Used a classic list-scheduler for task mappingI Measured

I Cost and makespan after task mapping

Name #VCPUs Network performance Cost / hourm1.small 1 moderate 0.09m1.med 2 moderate 0.18m1.large 4 high 0.36m1.xlarge 8 high 0.72m2.xlarge 6.5 moderate 0.506

m2.2xlarge 13 high 1.012m2.4xlarge 26 high 2.024

c1.med 5 moderate 0.186c1.xlarge 20 high 0.744

cc1.4xlarge 33.5 10 Gigabit Ethernet 0.186cc2.8xlarge 88 10 Gigabit Ethernet 0.744

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 5 10 15 20 25 30 35 40 45 50

Re

lative

ma

ke

sp

an

Budget limit

Equality

Figure Relative makespan ( EagerDeferred ) for all workflow applications

0

1

2

3

4

5

1 5 10 15 20 25 30 35 40 45

Re

lative

co

st

Budget limit

Equality

Figure Relative cost ( EagerDeferred ) for all workflow applications

First conclusions

I Eager is fast but cannot guarantee budget constraint aftermapping

I Deferred is slower, but guarantees budget constraint

I After a certain budget they yield to identical allocations

I for small applications and small budgets Deferred should bepreferred.

I When the size of the applications increases or the budget limitapproaches task parallelism saturation, using Eager ispreferable.

Outline

Context




Architecture

DIET (MADAG) Workflow engine

Nimbus

VM1

VM2

VM3

VMn

Acquire/release resources

Send workflow

Receive result

Deploy workflow tasks

Scientist

SeD Ramses

Phantom frontend

SeD Grafic1

Phantom frontend

Phantom

service

. . .

Manage Autoscaling Groups

Figure System architecture

Main components

Nimbus

I open-source IaaS provider

I provides low-level resources (VMs)

I compatible with the Amazon EC2 interface

I used a FutureGrid install

Main components

Phantom

I auto-scaling and high availability provider

I high-level resource provider

I subset of the Amazon auto-scale service

I part of the Nimbus platform

I used a FutureGrid install

I still under development

Main components

MADag

I workflow engine

I part of the DIET (Distributed Interactive Engineering Toolkit)software

I one service implementation per task

I each service launches its afferent task

I supports DAG, PTG and functional workflows

Main components

Client

I describes his workflow in xml

I implements the services

I calls the workflow engine

I no explicit resource management

I selects the IaaS provider to deploy on

How does it work?

DIET (MADAG) Workflow engine

Nimbus

VM1

VM2

VM3

VMn

Acquire/release resources

Send workflow

Receive result

Deploy workflow tasks

Scientist

SeD Ramses

Phantom frontend

SeD Grafic1

Phantom frontend

Phantom

service

. . .

Manage Autoscaling Groups

Figure System architecture

RAMSES

I n-body simulations of dark matter interactions

I backbone of galaxy formations

I AMR workflow application

I parallel (MPI) application

I can refine at different zoom levels

RAMSES

Figure The RAMSES workflow

Methodology

I used a FutureGrid Nimbus installation as a testbed

I measured running time for static and dynamic allocations

I estimated cost for each allocation

I varied maximum number of used resources

Figure Slice through a 28 × 28 × 28 box simulation

Results

5:00

10:00

15:00

20:00

25:00

0 2 4 6 8 10 12 14 16

Ru

nn

ing

tim

e (

MM

:SS

)

Number of VMs

Runtime (static)VM alloc (dynamic)Runtime (dynamic)

Figure Running times for a 26 × 26 × 26 box simulation

Results

0

5

10

15

20

25

0 2 4 6 8 10 12 14 16

Co

st

(un

its)

Number of VMs

Cost (static)Cost (dynamic)

Figure Estimated costs for a 26 × 26 × 26 box simulation

Outline

Context




Conclusions

I proposed two algorithms – Eager and Deferred with each theirpro and cons

I on-demand resources can better model workflow usage

I on-demand resources have a VM allocation overhead

I allocation overhead decreases with number of VMs

I for RAMSES, cost is greatly reduced

Perspectives

I preallocate VMs

I spot instances

I smarter scheduling strategy

I determine per application type which is the tipping point

I Compare our algorithms with others

Collaborations

I Continue the collaboration with the Nimbus/FutureGrid teams

I On the algorithms themselves (currently too complicated foran actual implementation)

I Understanding (obtaining models) clouds and virtualizedplatforms

I going from theoretical algorithms to (accurate) simulationsand actual implementation

Perspectives

I preallocate VMs

I spot instances

I smarter scheduling strategy

I determine per application type which is the tipping point

I Compare our algorithms with others

Collaborations

I Continue the collaboration with the Nimbus/FutureGrid teams

I On the algorithms themselves (currently too complicated foran actual implementation)

I Understanding (obtaining models) clouds and virtualizedplatforms

I going from theoretical algorithms to (accurate) simulationsand actual implementation

References

Eddy Caron, Frederic Desprez, Adrian Muresan and Frederic Suter.Budget Constrained Resource Allocation for Non-DeterministicWorkflows on a IaaS Cloud. 12th International Conference onAlgorithms and Architectures for Parallel Processing. (ICA3PP-12),Fukuoka, Japan, September 04 - 07, 2012

Adrian Muresan, Kate Keahey. Outsourcing computations for galaxysimulations. In eXtreme Science and Engineering Discovery Environment2012 - XSEDE12, Chicago, Illinois, USA, June 15 - 19 2012. Postersession.

Adrian Muresan. Scheduling and deployment of large-scale applicationson Cloud platforms. Laboratoire de l’Informatique du Parallelisme (LIP),ENS Lyon, France, Dec. 10th, 2012 PhD thesis.

Algorithm parameters

T overA - Eager

T overA =

1

B ′×|V i |∑j=1

(T (vj ,Alloc(vj ))× cost(vj )) ,

T underA - Deferred

T underA =

1

B ′×|V i |∑j=1

(T (vj ,Alloc(vj ))× costunder (vj ))

Budget distribution algorithm

1: ω∗ ← 02: for all G i = (V i , E i ) ⊆ G do3: nExec i ← CDF−1(D i ,Confidence)

4: ωi ←∑

vj∈V i

1

|C|∑

vmk∈CT (vj , vmk )

× nExec i

5: ω∗ ← ω∗ + ωi

6: end for7: for all G i ⊆ G do8: B i ← B × ωi

ω∗ ×1

nExec i

9: end forAlgorithm 2: Share Budget(B,G,Confidence)

Absolute makespans

00:00:00

00:10:00

00:20:00

00:30:00

00:40:00

00:50:00

01:00:00

01:10:00

01:20:00

1 5 10 15 20 25 30 35 40 45

Ma

ke

sp

an

(H

H:M

M:S

S)

Budget limit

(a) Eager for all workflow applications

Absolute makespans

00:00:00

00:10:00

00:20:00

00:30:00

00:40:00

00:50:00

01:00:00

01:10:00

01:20:00

1 5 10 15 20 25 30 35 40 45

Ma

ke

sp

an

(H

H:M

M:S

S)

Budget limit

(b) Deferred for all workflow applications

Absolute cost

0

10

20

30

40

50

1 5 10 15 20 25 30 35 40 45 50

Co

st

Budget limit

Budget limit

(c) Eager for all workflow applications

Absolute cost

0

10

20

30

40

50

1 5 10 15 20 25 30 35 40 45 50

Co

st

Budget limit

Budget limit

(d) Deferred for all workflow applications

Documents

Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice