A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

1

A science-gateway for workflow executions: online and non-clairvoyant self-healing

of workflow executions on grids

Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Supervisors: Frédéric DESPREZ and Tristan GLATARD

This work was funded by the French National Agency for Research under grant ANR-09-COSI-03 "VIP”

Outline

�  Technical context and challenges

�  Contributions �  Self-healing of workflow executions on grids

�  Treatment of blocked activities �  Optimization of task granularity �  Fairness control among workflow executions

�  Conclusions

2

Outline




�  Conclusions

3

Heavy Medical Simulations

4

Treatement planning for prostate protontherapy [L. Grevillot, D. Sarrut] CPU Time: 2 months

Simulated diffusion weighted images [L. Wang, Y. Zhu, I. Magnin] CPU Time: 8 years

Echography simulation [O. Bernard, M. Alessandrini] CPU Time: 42 hours

Virtual Imaging Platform

Public Computing Infrastructure 150 computing sites world-wide

Medical-Imaging Execution Platform 491 users from 52 countries

Goal: Self-healing of workflow executions on grids to handle operational issues

Workflow Execution

2. User launches a simulation

(application workflow)

3. Workflow engine generates invocations

4. Invocations are wrapped into grid jobs

5. Jobs are submitted to a Pilot Engine

6. Pilot jobs are submitted to the

distributed infrastructure

1. Input data upload

7. Pilot jobs fetch grid jobs

8. Inputs download

10. Results upload

11. Download results

9. Execution

5

Science-Gateway

High-level interface Software-as-a-Service

Virt

ual I

mag

ing

Plat

form

(VIP

)

Workflow Execution










8. Inputs download

10. Results upload


9. Execution

6

Workflow Management System Applications described as workflows Parallel language Grid-aware enactor

Workflow Execution










8. Inputs download

10. Results upload


9. Execution

7

Workload Management System Pilot jobs run special agents that fetch user tasks from the task queue, set up their environment and steer their execution

Workflow Execution










8. Inputs download

10. Results upload


9. Execution

8

European Grid Infrastructure (EGI) +100 computing sites +25,000 job slots ~4PB of Storage

Challenges

9

�  Several workflow execution errors

�  Several dysfunctional and performance problems �  Requires manual interventions

�  Problem: costly manual operations �  e.g.: rescheduling tasks, restarting services, killing misbehaving

experiments, or replicating data files

Number of launched and completed workflow in VIP from Jan to Dec 2012

Average workflow completion rate is about 60%

Objectives

10

�  Objective: Automated platform administration �  Autonomous detection of operational incidents

�  Perform appropriate set of actions

�  Assumptions: Online and non-clairvoyant �  Decisions must be fast

�  No information about tasks (duration, data transfer time, etc.)

�  No information about resources (availability, performance, etc.)

�  No user activity and workloads prediction

Outline




�  Conclusions

11

State of the Art

12

�  Self-healing of workflow executions �  Most works from the literature are offline and/or clairvoyant

�  Common techniques to address operational incidents �  Task resubmission

�  [Kandaswamy et al., 2008], [Zhang et al., 2009], [Montagnat et al., 2010]

�  Task and file replication �  [Cirne et al., 2007], [Ben-Yehuda et al., 2012], [Ma et al., 2013]

�  Task grouping �  [Muthuvelu et al., 2005-2013], [Lie and Liao, 2009], [Chen et al., 2013]

�  Heuristics to fairly schedule workflow tasks �  [Zhao and Sakellariou, 2006], [N’Takpe and Suter, 2009], [Casanova et al., 2010]

�  The healing process sets the degree of FuSM states from incident detection metrics

Fuzzy Finite State Machine

13

��

��

��

��

��

Fuzzy states

Cri

sp s

tate

s

Possible values: 0 or 1

Values between 0 and 1

General MAPE-K loop

14

Incident 1 degree η = 0.8



level 1

level2

level3

Roulette wheel selection

Incident 1

Selected

Rule Confidence (ρ) ρxη

2è 1 0.8 0.32

3 è 1 0.2 0.02

1 è 1 1.0 0.80

Association rules for incident 1

Incident 2

Selected

Roulette wheel selection based on association rules

Set of Actions

x2

level 1

level2

level3

level 1

level2

level3

€

=ηiη jj=1

n∑

event (job completion and failures)

or timeout

Monitoring Analysis

Execution Knowledge

Planning

Monitoring data

ηu

Frequency

0.0 0.2 0.4 0.6 0.8 1.00e+00

6e+04

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on ��distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.

�  Incident degrees are quantified in discrete incident levels

�  Thresholds are determined from mode clustering

Incident Levels and Actions

15

No actions are triggered

��

Thresholds τ cluster platform configurations into groups

Triggers a set of actions

A-priori knowledge �  Based on the workload of VIP

�  January 2011 to April 2012

112 users 2,941 workflow executions 680,988 tasks

338,989 completed 138,480 error 105,488 aborted 15,576 aborted replicas

48,293 stalled 34,162 queued

339,545 pilot jobs

16

R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.

Outline




�  Conclusions

17

�  A task is late compared to the others

�  Possible causes �  Longer waiting times

�  Lost tasks (e.g. killed by site due to quota violation)

�  Resources with poor performance

0.0e+00 4.0e+06 8.0e+06 1.2e+07

020

4060

80100

FIELD-II/pasa - workflow-9SIeNv

Time (s)

Com

plet

ed J

obs

Incident: Activity Blocked

18

Task completion rate of a real simulation Job flow of a real simulation

Long-tail effect

Activity Blocked: State of the Art �  Task replication

�  Is commonly used to address non-clairvoyant problems

�  Drawback: may overload the system and degrade fairness

�  Task replication in the literature �  Is used to increase the probability to complete a task [Ramakrishnan et

al., 2009]

�  Use of the Weibull distribution to estimate the number of replicas [Litke et al., 2007]

�  Tasks are replicated only in the tail phase [Ben-Yehuda et al., 2012]

�  Evaluation of the waste of resources by using replication [Cirne et al., 2007]

19

All approaches make strong assumptions on task or resource characteristics

Activity Blocked: Degree �  Degree computed from all completed tasks of the activity

�  Task phases: setup è inputs download è execution è outputs upload

�  Assumption: bag of tasks (all tasks have equal durations)

�  Median-based estimation:

�  Incident degree: task performance w.r.t median

20

Median duration of task phases

Real task duration

42s

300s

20s

?

42s

300s

400s*

15s

Estimated task duration

50s

250s

400s

15s

completed

current

Mi = 715s Ei = 757s

*: max(400s, 20s) = 400s

�  Levels: identified from the platform logs extracted from VIP on EGI

�  Actions �  Task replication

�  Cancel replicas with bad performance

�  Replicate only if all active replicas are running

0

50

100

150

0.00 0.25 0.50 0.75 1.00ηb

Frequency

Activity blocked: levels and actions

21

Replication process for one task

Level 1 (no actions)

Level 2

action: replicate tasks

d

€

τb Activity Blocked degree ηb

Level 1 Level 2

�  Goal: Self-Healing vs No-Healing �  Cope with recoverable errors

Self-Healing process reduced resource consumption up to 35% when compared to

the No-Healing execution

Activity Blocked: Results

22

0

4000

8000

12000

1 2 3 4 5Repetitions

Mak

espa

n (s

)

No−HealingSelf−Healing

0

4000

8000

12000

1 2 3 4 5Repetitions

Mak

espa

n (s

)

No−HealingSelf−Healing

€

w =(CPU + data) self −healing(CPU + data)no−healing

−1

Resource waste:

Mean-Shift/hs3 FIELD-II/pasa

Average execution speed up: 3.4 Average execution speed up: 2.9

Repetition 1 Repetition 2 Repetition 3

Repetition 4 Repetition 5

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

0 50 100 150 0 50 100 150 200 0 50 100 150

0 20 40 60 0 50 100Time (min)

CD

F No−HealingSelf−Healing

Number of Completed Tasks

23

Curve similarities up to 95% indicate similar grid conditions

Activity Blocked: Conclusions

�  First results in controlling blocked activities in these conditions �  Conditions: production system, non-clairvoyant, online

�  Limitation �  The method only works for bag-of-tasks

�  The waste metric does not consider resource performance

�  Currently used in production by VIP �  From Aug 2012 to Oct 2013 more than 6000 workflow executions benefited

�  Publications

24

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on ��distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of operational workflow incidents on distributed computing infrastructures, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa, Canada, 2012.

Outline




�  Conclusions

25

�  Low performance of lightweight (a.k.a. fine-grained) tasks: �  High queuing times

�  Communication overhead

Incident: Fineness Control

26

time

R1

R2

R3

t1

t2

t3

t4

t5

t1 t2

t3

t4

t5

Res

ourc

es

lightweight tasks Lightweight task executions are delayed

Group into coarse-grained tasks reduces the cost of data transfers

when grouped tasks share input data, and saves queuing time

Fineness Control: State of the Art �  Task grouping in the literature

�  Groups tasks based on the granularity size (processing time) [Muthuvelu et al., 2005]

�  Adds bandwidth to the definition of the granularity size [Ng et al., 2006], [And et al., 2009]

�  Defines the granularity size based on QoS requirements

�  Task file size, CPU time, resource constraints [Muthuvelu et al., 2008]

�  Drawback: only works under stationary load

�  Adaptive algorithms (non-stationary load)

�  Monitors information about the current availability and capability of resources [Liu and Liao, 2009], [Muthuvelu et al., 2013]

27

All approaches make strong assumptions on task or resource characteristics

�  Task execution

�  Incident degree

Fineness Control: Degree

28

€

η f =maxi∈[1,m ]{ f i = di ⋅ ri}

Queued Time Shared Input Data Other Input Data Application Execution

€

t~_ shared

€

t

€

q j

Median task phase durations

i = waiting task n = number of waiting tasks

Fineness control: levels and actions

29


�  Actions �  Task grouping

�  Grouped pairwise until or until Q ≤ R

ηf

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

6e+04

€

τ f

Level 1 (no actions)

Level 2

action: task grouping

€

η f ≤ τ f

Fineness Control degree ηf

Level 1 Level 2

�  Levels �  Incident degree

Coarseness control

30

€

ηc =R

Q+ R

€

τc = 0.5

time

R1

R2

R3

t1

t2

t3

t4

t5

t1

t2+t3

t4+t5

Res

ourc

es

Tasks at t1

t2+t3

t4+t5 Loss of parallelism

�  Non-stationary load �  Loss of parallelism

�  Task-degrouping

t1 t2

Grouped tasks at t2

De-group tasks when R > Q

0

2000

4000

6000

Run 1 Run 2 Run 3 Run 4 Run 5

Mak

espa

n (s

)

FinenessFineness−CoarsenessNo−Granularity

�  Experiment �  Evaluate the de-grouping control process under non-stationary load

31

Results: Non-Stationary Load

31

Resources appear progressively Resources appear suddenly

Speeds up executions up to a factor of 1.5 for Fineness, and 2.1 for Fineness-Coarseness

Fineness is penalized by its lack of adaptation: slowdown of 20%

Linear correlation coefficient between the makespan and the average queuing time is 0.91, which indicates they are correlated

Task Granularity: Conclusions

�  First results in controlling task granularity in these conditions �  Conditions: production system, non-clairvoyant, online

�  Limitation �  The method only works for data-intensive workloads

�  Future Work �  Task pre-emption to handle the scenario where resources suddenly appear

and all tasks are running

�  Publications

32

R. Ferreira da Silva, T. Glatard, F. Desprez, On-line, non-clairvoyant optimization of workflow activity granularity task on grids, Euro-Par, Aachen, 2013.

R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: Practice and Experience (CCPE), Submited, 2014.

Outline




�  Conclusions

33

�  Under resource contention workflows are unequally slowed down by concurrent executions

Incident: Unfairness Among Workflow Executions

34

3 identical workflows submitted sequentially

(ti,j = 10s)

t2,2

t2,3

t3,1

t2,4

t2,1

t1,2

t1,1

t1,3

t1,4

t3,2

t3,3

t3,4

t1,5 t3,5 t2,5

time

R1

R2

R3

Res

ourc

es

t1,1 t1,4

t1,5 t1,2

t1,3 t2,1

t2,2

t2,3

t2,4

t2,5

t3,1

t3,2

t3,3

t3,4

t3,5

0 10 20 30 40

€

slowdown(s) =Mmulti

Mown

€

s1 =2020

=1.0

€

s2 =4020

= 2.0

€

s3 =5020

= 2.5

Identical workflow executions do not experience the same slowdown

Makespan with concurrent executions

Makespan without concurrent executions

Fairness: State of the Art �  Workflow execution fairness in the literature

�  Addresses fairness based on the slowdown of DAGs based on execution and data transfer times [Zhao and Sakellariou, 2006], [Casanova et al., 2010]

�  Proposes a mapping procedure to increase fairness based on the critical path length [N’Takpe and Suter, 2009]

�  Online, but clairvoyant, HEFT-like algorithms [Hsu et al., 2011], [Sommerfield and Richter, 2011], [Arabnejad and Barbosa, 2012]

�  Non-clairvoyant, but offline, scheduling strategy based on task labeling and adaptive allocation [Hirales-Carbajal et al., 2012]

35

No algorithm was proposed in a non-clairvoyant and online case

�  Unfairness degree

where:

Fairness Control: Degree

36

€

ηu =Wmax −Wmin

€

Wi =max j∈[1,ni ]

Qi, j

Qi, j + Ri, j ⋅ Pi, j⋅ Ti, j

$ % &

' ( )

i = activity, ni = active activities Qi,j = number of waiting tasks Ri,j = number of running tasks

Relative observed duration Performance

Median task phase durations

Max difference between the fractions of pending work

A low Pi,j indicates that resources allocated to the activity have bad

performance for the activity


�  Actions �  Task prioritization

�  Task priority is an integer initialized to 1

�  Increase priority of Δi,j tasks

ηu

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

6e+04

Fairness Control: Levels and Actions

37

€

τuLevel 1 (no actions)

Level 2

Fairness Control degree ηu

Level 1 Level 2

action: task prioritization

Fairness Control: Metrics

38

�  Unfairness �  Is the area under the curve ηu during the execution:

�  Slowdown

where:

€

s =Mmulti

Mown

€

µ = ηu(ti)⋅ (ti − ti−1)i=2

M

∑

€

Mown =maxp∈Ω tuu∈p∑

This metric measures if the fairness process can indeed minimize its own criterion ηu

�  Tests whether unfairness among identical workflows is properly addressed

Results: identical workflows

39

Repetition 1 Repetition 2 Repetition 3 Repetition 4

0.00

0.25

0.50

0.75

1.00

0 10000 20000 300000 5000 10000 15000 200000 10000 20000 30000 0 500010000150002000025000Time (s)

ηf Fairness

No−Fairness


0

10000

20000

30000

Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness

Mak

espa

n (s

)

Gate 1Gate 2Gate 3

Makespans and unfairness degree values are significantly reduced

�  Tests whether unfairness among different workflows is detected and properly handled

Results: different workflows

40


1

10

100

Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness

Slow

dow

n FIELD−IIGatePET−SorteoSimuBloch


0.00

0.25

0.50

0.75

1.00

0 5000 100001500020000 0 10000 20000 0 20000 40000 0 5000100001500020000Time (s)

η f FairnessNo−Fairness

Reduced slowdown stand. dev. up to a factor of 3.8, and unfairness value up to a factor 1.9

�  First results in controlling fairness among workflow executions in these conditions

�  Conditions: production system, non-clairvoyant, online

�  Limitation �  Fairness optimization is delayed due to the acquisition of information

about the applications

�  The method works best for applications with a lot of short tasks

�  Future Work �  Evaluation of the influence of the metrics’ parameters

�  Publications

41

Fairness Control: Conclusions

R. Ferreira da Silva, T. Glatard, F. Desprez, Workflow fairness control on online and non-clairvoyant distributed computing platforms, Euro-Par, Aachen, 2013.

R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: Practice and Experience (CCPE), Submited, 2014.

Outline




�  Conclusions

42

Contributions Summary

43

Self-healing of workflow incidents - Generic MAPE-K loop - Non-clairvoyance and online

[Ferreira da Silva et al., CCGRID’12, FGCS’13]

Treatment of blocked activities - Properly detects and handles blocked activities

Optimization of task granularity - Properly detects and handles lightweight tasks under

stationary and non-stationary loads

[Ferreira da Silva et al., Euro-Par’13a]

Fairness control among workflow executions - Properly detects and handles unfairness among

workflow executions

[Ferreira da Silva et al., Euro-Par’13b, CPE’14]

Science-gateway model for workload archive - Illustration by using traces of the VIP from 2011/2012

[Ferreira da Silva and Glatard, CGWS’12]

All methods were evaluated on VIP - Production platform with about 500 users

[Ferreira da Silva et al., HealthGrid’11; Glatard et al., TMI’13]

Perspectives

44

�  Mode detection automation

�  Automatically detect variation on threshold values

�  Time-windowed historical information

�  User’s behavior may change

�  Errors may be restricted to a specific time span

�  Optimization of the incident selection method �  There is no mechanism to prevent an incident to be successively selected

�  Sensitivity analysis of parameters �  Evaluate the influence of parameters on the metrics

�  Workflow workload archive

�  The science gateway workload archive model does not embrace all characteristics inherent to a workflow execution

Thank you for your attention. Questions?

http://vip.creatis.insa-lyon.fr!

Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

A science-gateway for workflow executions: online and non-clairvoyant self-healing

of workflow executions on grids

Supervisors: Frédéric DESPREZ and Tristan GLATARD

Technology

A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids