Upload
rafael-ferreira-da-silva
View
193
Download
2
Tags:
Embed Size (px)
DESCRIPTION
PhD Thesis presented on November 29th 2013 at INSA-Lyon Abstract - Science gateways, such as the Virtual Imaging Platform (VIP), enable transparent access to distributed computing and storage resources for scientific computations. However, their large scale and the number of middleware systems involved lead to many errors and faults. In practice, science gateways are often backed by substantial support staff who monitors running experiments by performing simple yet crucial actions such as rescheduling tasks, restarting services, killing misbehaving runs or replicating data files to reliable storage facilities. Fair quality of service (QoS) can then be delivered, yet with important human intervention. Automating such operations is challenging for two reasons. First, the problem is online by nature because no reliable user activity prediction can be assumed, and new workloads may arrive at any time. Therefore, the considered metrics, decisions and actions have to remain simple and to yield results while the application is still executing. Second, it is non-clairvoyant due to the lack of information about applications and resources in production conditions. Computing resources are usually dynamically provisioned from heterogeneous clusters, clouds or desktop grids without any reliable estimate of their availability and characteristics. Models of application execution times are hardly available either, in particular on heterogeneous computing resources. In this thesis, we propose a general healing process for autonomous detection and handling of operational incidents in workflow executions. Instances are modeled as Fuzzy Finite State Machines (FuSM) where state degrees of membership are determined by an external healing process. Degrees of membership are computed from metrics assuming that incidents have outlier performance, e.g. a site or a particular invocation behaves differently than the others. Based on incident degrees, the healing process identifies incident levels using thresholds determined from the platform history. A specific set of actions is then selected from association rules among incident levels. For more information visit http://www.rafaelsilva.com
Citation preview
1
A science-gateway for workflow executions: online and non-clairvoyant self-healing
of workflow executions on grids
Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Supervisors: Frédéric DESPREZ and Tristan GLATARD
This work was funded by the French National Agency for Research under grant ANR-09-COSI-03 "VIP”
Outline
� Technical context and challenges
� Contributions � Self-healing of workflow executions on grids
� Treatment of blocked activities � Optimization of task granularity � Fairness control among workflow executions
� Conclusions
2
Outline
� Technical context and challenges
� Contributions � Self-healing of workflow executions on grids
� Treatment of blocked activities � Optimization of task granularity � Fairness control among workflow executions
� Conclusions
3
Heavy Medical Simulations
4
Treatement planning for prostate protontherapy [L. Grevillot, D. Sarrut] CPU Time: 2 months
Simulated diffusion weighted images [L. Wang, Y. Zhu, I. Magnin] CPU Time: 8 years
Echography simulation [O. Bernard, M. Alessandrini] CPU Time: 42 hours
Virtual Imaging Platform
Public Computing Infrastructure 150 computing sites world-wide
Medical-Imaging Execution Platform 491 users from 52 countries
Goal: Self-healing of workflow executions on grids to handle operational issues
Workflow Execution
2. User launches a simulation
(application workflow)
3. Workflow engine generates invocations
4. Invocations are wrapped into grid jobs
5. Jobs are submitted to a Pilot Engine
6. Pilot jobs are submitted to the
distributed infrastructure
1. Input data upload
7. Pilot jobs fetch grid jobs
8. Inputs download
10. Results upload
11. Download results
9. Execution
5
Science-Gateway
High-level interface Software-as-a-Service
Virt
ual I
mag
ing
Plat
form
(VIP
)
Workflow Execution
2. User launches a simulation
(application workflow)
3. Workflow engine generates invocations
4. Invocations are wrapped into grid jobs
5. Jobs are submitted to a Pilot Engine
6. Pilot jobs are submitted to the
distributed infrastructure
1. Input data upload
7. Pilot jobs fetch grid jobs
8. Inputs download
10. Results upload
11. Download results
9. Execution
6
Workflow Management System Applications described as workflows Parallel language Grid-aware enactor
Workflow Execution
2. User launches a simulation
(application workflow)
3. Workflow engine generates invocations
4. Invocations are wrapped into grid jobs
5. Jobs are submitted to a Pilot Engine
6. Pilot jobs are submitted to the
distributed infrastructure
1. Input data upload
7. Pilot jobs fetch grid jobs
8. Inputs download
10. Results upload
11. Download results
9. Execution
7
Workload Management System Pilot jobs run special agents that fetch user tasks from the task queue, set up their environment and steer their execution
Workflow Execution
2. User launches a simulation
(application workflow)
3. Workflow engine generates invocations
4. Invocations are wrapped into grid jobs
5. Jobs are submitted to a Pilot Engine
6. Pilot jobs are submitted to the
distributed infrastructure
1. Input data upload
7. Pilot jobs fetch grid jobs
8. Inputs download
10. Results upload
11. Download results
9. Execution
8
European Grid Infrastructure (EGI) +100 computing sites +25,000 job slots ~4PB of Storage
Challenges
9
� Several workflow execution errors
� Several dysfunctional and performance problems � Requires manual interventions
� Problem: costly manual operations � e.g.: rescheduling tasks, restarting services, killing misbehaving
experiments, or replicating data files
Number of launched and completed workflow in VIP from Jan to Dec 2012
Average workflow completion rate is about 60%
Objectives
10
� Objective: Automated platform administration � Autonomous detection of operational incidents
� Perform appropriate set of actions
� Assumptions: Online and non-clairvoyant � Decisions must be fast
� No information about tasks (duration, data transfer time, etc.)
� No information about resources (availability, performance, etc.)
� No user activity and workloads prediction
Outline
� Technical context and challenges
� Contributions � Self-healing of workflow executions on grids
� Treatment of blocked activities � Optimization of task granularity � Fairness control among workflow executions
� Conclusions
11
State of the Art
12
� Self-healing of workflow executions � Most works from the literature are offline and/or clairvoyant
� Common techniques to address operational incidents � Task resubmission
� [Kandaswamy et al., 2008], [Zhang et al., 2009], [Montagnat et al., 2010]
� Task and file replication � [Cirne et al., 2007], [Ben-Yehuda et al., 2012], [Ma et al., 2013]
� Task grouping � [Muthuvelu et al., 2005-2013], [Lie and Liao, 2009], [Chen et al., 2013]
� Heuristics to fairly schedule workflow tasks � [Zhao and Sakellariou, 2006], [N’Takpe and Suter, 2009], [Casanova et al., 2010]
� The healing process sets the degree of FuSM states from incident detection metrics
Fuzzy Finite State Machine
13
������������ �����
�� ������ ������
�������� �
�������� �
�������� �
Fuzzy states
Cri
sp s
tate
s
Possible values: 0 or 1
Values between 0 and 1
General MAPE-K loop
14
Incident 1 degree η = 0.8
Incident 2 degree η = 0.4
Incident 3 degree η = 0.1
level 1
level2
level3
Roulette wheel selection
Incident 1
Selected
Rule Confidence (ρ) ρxη
2è 1 0.8 0.32
3 è 1 0.2 0.02
1 è 1 1.0 0.80
Association rules for incident 1
Incident 2
Selected
Roulette wheel selection based on association rules
Set of Actions
x2
level 1
level2
level3
level 1
level2
level3
€
=ηiη jj=1
n∑
event (job completion and failures)
or timeout
Monitoring Analysis
Execution Knowledge
Planning
Monitoring data
ηu
Frequency
0.0 0.2 0.4 0.6 0.8 1.00e+00
6e+04
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on ���distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.
� Incident degrees are quantified in discrete incident levels
� Thresholds are determined from mode clustering
Incident Levels and Actions
15
No actions are triggered
���������� ������� ���
Thresholds τ cluster platform configurations into groups
Triggers a set of actions
A-priori knowledge � Based on the workload of VIP
� January 2011 to April 2012
112 users 2,941 workflow executions 680,988 tasks
338,989 completed 138,480 error 105,488 aborted 15,576 aborted replicas
48,293 stalled 34,162 queued
339,545 pilot jobs
16
R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.
Outline
� Technical context and challenges
� Contributions � Self-healing of workflow executions on grids
� Treatment of blocked activities � Optimization of task granularity � Fairness control among workflow executions
� Conclusions
17
� A task is late compared to the others
� Possible causes � Longer waiting times
� Lost tasks (e.g. killed by site due to quota violation)
� Resources with poor performance
0.0e+00 4.0e+06 8.0e+06 1.2e+07
020
4060
80100
FIELD-II/pasa - workflow-9SIeNv
Time (s)
Com
plet
ed J
obs
Incident: Activity Blocked
18
Task completion rate of a real simulation Job flow of a real simulation
Long-tail effect
Activity Blocked: State of the Art � Task replication
� Is commonly used to address non-clairvoyant problems
� Drawback: may overload the system and degrade fairness
� Task replication in the literature � Is used to increase the probability to complete a task [Ramakrishnan et
al., 2009]
� Use of the Weibull distribution to estimate the number of replicas [Litke et al., 2007]
� Tasks are replicated only in the tail phase [Ben-Yehuda et al., 2012]
� Evaluation of the waste of resources by using replication [Cirne et al., 2007]
19
All approaches make strong assumptions on task or resource characteristics
Activity Blocked: Degree � Degree computed from all completed tasks of the activity
� Task phases: setup è inputs download è execution è outputs upload
� Assumption: bag of tasks (all tasks have equal durations)
� Median-based estimation:
� Incident degree: task performance w.r.t median
20
Median duration of task phases
Real task duration
42s
300s
20s
?
42s
300s
400s*
15s
Estimated task duration
50s
250s
400s
15s
completed
current
Mi = 715s Ei = 757s
*: max(400s, 20s) = 400s
� Levels: identified from the platform logs extracted from VIP on EGI
� Actions � Task replication
� Cancel replicas with bad performance
� Replicate only if all active replicas are running
0
50
100
150
0.00 0.25 0.50 0.75 1.00ηb
Frequency
Activity blocked: levels and actions
21
Replication process for one task
Level 1 (no actions)
Level 2
action: replicate tasks
d
€
τb Activity Blocked degree ηb
Level 1 Level 2
� Goal: Self-Healing vs No-Healing � Cope with recoverable errors
Self-Healing process reduced resource consumption up to 35% when compared to
the No-Healing execution
Activity Blocked: Results
22
0
4000
8000
12000
1 2 3 4 5Repetitions
Mak
espa
n (s
)
No−HealingSelf−Healing
0
4000
8000
12000
1 2 3 4 5Repetitions
Mak
espa
n (s
)
No−HealingSelf−Healing
€
w =(CPU + data) self −healing(CPU + data)no−healing
−1
Resource waste:
Mean-Shift/hs3 FIELD-II/pasa
Average execution speed up: 3.4 Average execution speed up: 2.9
Repetition 1 Repetition 2 Repetition 3
Repetition 4 Repetition 5
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
0 50 100 150 0 50 100 150 200 0 50 100 150
0 20 40 60 0 50 100Time (min)
CD
F No−HealingSelf−Healing
Number of Completed Tasks
23
Curve similarities up to 95% indicate similar grid conditions
Activity Blocked: Conclusions
� First results in controlling blocked activities in these conditions � Conditions: production system, non-clairvoyant, online
� Limitation � The method only works for bag-of-tasks
� The waste metric does not consider resource performance
� Currently used in production by VIP � From Aug 2012 to Oct 2013 more than 6000 workflow executions benefited
� Publications
24
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on ���distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of operational workflow incidents on distributed computing infrastructures, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa, Canada, 2012.
Outline
� Technical context and challenges
� Contributions � Self-healing of workflow executions on grids
� Treatment of blocked activities � Optimization of task granularity � Fairness control among workflow executions
� Conclusions
25
� Low performance of lightweight (a.k.a. fine-grained) tasks: � High queuing times
� Communication overhead
Incident: Fineness Control
26
time
R1
R2
R3
t1
t2
t3
t4
t5
t1 t2
t3
t4
t5
Res
ourc
es
lightweight tasks Lightweight task executions are delayed
Group into coarse-grained tasks reduces the cost of data transfers
when grouped tasks share input data, and saves queuing time
Fineness Control: State of the Art � Task grouping in the literature
� Groups tasks based on the granularity size (processing time) [Muthuvelu et al., 2005]
� Adds bandwidth to the definition of the granularity size [Ng et al., 2006], [And et al., 2009]
� Defines the granularity size based on QoS requirements
� Task file size, CPU time, resource constraints [Muthuvelu et al., 2008]
� Drawback: only works under stationary load
� Adaptive algorithms (non-stationary load)
� Monitors information about the current availability and capability of resources [Liu and Liao, 2009], [Muthuvelu et al., 2013]
27
All approaches make strong assumptions on task or resource characteristics
� Task execution
� Incident degree
Fineness Control: Degree
28
€
η f =maxi∈[1,m ]{ f i = di ⋅ ri}
Queued Time Shared Input Data Other Input Data Application Execution
€
t~_ shared
€
t
€
q j
Median task phase durations
i = waiting task n = number of waiting tasks
Fineness control: levels and actions
29
� Levels: identified from the platform logs extracted from VIP on EGI
� Actions � Task grouping
� Grouped pairwise until or until Q ≤ R
ηf
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
6e+04
€
τ f
Level 1 (no actions)
Level 2
action: task grouping
€
η f ≤ τ f
Fineness Control degree ηf
Level 1 Level 2
� Levels � Incident degree
Coarseness control
30
€
ηc =R
Q+ R
€
τc = 0.5
time
R1
R2
R3
t1
t2
t3
t4
t5
t1
t2+t3
t4+t5
Res
ourc
es
Tasks at t1
t2+t3
t4+t5 Loss of parallelism
� Non-stationary load � Loss of parallelism
� Task-degrouping
t1 t2
Grouped tasks at t2
De-group tasks when R > Q
0
2000
4000
6000
Run 1 Run 2 Run 3 Run 4 Run 5
Mak
espa
n (s
)
FinenessFineness−CoarsenessNo−Granularity
� Experiment � Evaluate the de-grouping control process under non-stationary load
31
Results: Non-Stationary Load
31
Resources appear progressively Resources appear suddenly
Speeds up executions up to a factor of 1.5 for Fineness, and 2.1 for Fineness-Coarseness
Fineness is penalized by its lack of adaptation: slowdown of 20%
Linear correlation coefficient between the makespan and the average queuing time is 0.91, which indicates they are correlated
Task Granularity: Conclusions
� First results in controlling task granularity in these conditions � Conditions: production system, non-clairvoyant, online
� Limitation � The method only works for data-intensive workloads
� Future Work � Task pre-emption to handle the scenario where resources suddenly appear
and all tasks are running
� Publications
32
R. Ferreira da Silva, T. Glatard, F. Desprez, On-line, non-clairvoyant optimization of workflow activity granularity task on grids, Euro-Par, Aachen, 2013.
R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: Practice and Experience (CCPE), Submited, 2014.
Outline
� Technical context and challenges
� Contributions � Self-healing of workflow executions on grids
� Treatment of blocked activities � Optimization of task granularity � Fairness control among workflow executions
� Conclusions
33
� Under resource contention workflows are unequally slowed down by concurrent executions
Incident: Unfairness Among Workflow Executions
34
3 identical workflows submitted sequentially
(ti,j = 10s)
t2,2
t2,3
t3,1
t2,4
t2,1
t1,2
t1,1
t1,3
t1,4
t3,2
t3,3
t3,4
t1,5 t3,5 t2,5
time
R1
R2
R3
Res
ourc
es
t1,1 t1,4
t1,5 t1,2
t1,3 t2,1
t2,2
t2,3
t2,4
t2,5
t3,1
t3,2
t3,3
t3,4
t3,5
0 10 20 30 40
€
slowdown(s) =Mmulti
Mown
€
s1 =2020
=1.0
€
s2 =4020
= 2.0
€
s3 =5020
= 2.5
Identical workflow executions do not experience the same slowdown
Makespan with concurrent executions
Makespan without concurrent executions
Fairness: State of the Art � Workflow execution fairness in the literature
� Addresses fairness based on the slowdown of DAGs based on execution and data transfer times [Zhao and Sakellariou, 2006], [Casanova et al., 2010]
� Proposes a mapping procedure to increase fairness based on the critical path length [N’Takpe and Suter, 2009]
� Online, but clairvoyant, HEFT-like algorithms [Hsu et al., 2011], [Sommerfield and Richter, 2011], [Arabnejad and Barbosa, 2012]
� Non-clairvoyant, but offline, scheduling strategy based on task labeling and adaptive allocation [Hirales-Carbajal et al., 2012]
35
No algorithm was proposed in a non-clairvoyant and online case
� Unfairness degree
where:
Fairness Control: Degree
36
€
ηu =Wmax −Wmin
€
Wi =max j∈[1,ni ]
Qi, j
Qi, j + Ri, j ⋅ Pi, j⋅ Ti, j
$ % &
' ( )
i = activity, ni = active activities Qi,j = number of waiting tasks Ri,j = number of running tasks
Relative observed duration Performance
Median task phase durations
Max difference between the fractions of pending work
A low Pi,j indicates that resources allocated to the activity have bad
performance for the activity
� Levels: identified from the platform logs extracted from VIP on EGI
� Actions � Task prioritization
� Task priority is an integer initialized to 1
� Increase priority of Δi,j tasks
ηu
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+00
6e+04
Fairness Control: Levels and Actions
37
€
τuLevel 1 (no actions)
Level 2
Fairness Control degree ηu
Level 1 Level 2
action: task prioritization
Fairness Control: Metrics
38
� Unfairness � Is the area under the curve ηu during the execution:
� Slowdown
where:
€
s =Mmulti
Mown
€
µ = ηu(ti)⋅ (ti − ti−1)i=2
M
∑
€
Mown =maxp∈Ω tuu∈p∑
This metric measures if the fairness process can indeed minimize its own criterion ηu
� Tests whether unfairness among identical workflows is properly addressed
Results: identical workflows
39
Repetition 1 Repetition 2 Repetition 3 Repetition 4
0.00
0.25
0.50
0.75
1.00
0 10000 20000 300000 5000 10000 15000 200000 10000 20000 30000 0 500010000150002000025000Time (s)
ηf Fairness
No−Fairness
Repetition 1 Repetition 2 Repetition 3 Repetition 4
0
10000
20000
30000
Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness
Mak
espa
n (s
)
Gate 1Gate 2Gate 3
Makespans and unfairness degree values are significantly reduced
� Tests whether unfairness among different workflows is detected and properly handled
Results: different workflows
40
Repetition 1 Repetition 2 Repetition 3 Repetition 4
1
10
100
Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness
Slow
dow
n FIELD−IIGatePET−SorteoSimuBloch
Repetition 1 Repetition 2 Repetition 3 Repetition 4
0.00
0.25
0.50
0.75
1.00
0 5000 100001500020000 0 10000 20000 0 20000 40000 0 5000100001500020000Time (s)
η f FairnessNo−Fairness
Reduced slowdown stand. dev. up to a factor of 3.8, and unfairness value up to a factor 1.9
� First results in controlling fairness among workflow executions in these conditions
� Conditions: production system, non-clairvoyant, online
� Limitation � Fairness optimization is delayed due to the acquisition of information
about the applications
� The method works best for applications with a lot of short tasks
� Future Work � Evaluation of the influence of the metrics’ parameters
� Publications
41
Fairness Control: Conclusions
R. Ferreira da Silva, T. Glatard, F. Desprez, Workflow fairness control on online and non-clairvoyant distributed computing platforms, Euro-Par, Aachen, 2013.
R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: Practice and Experience (CCPE), Submited, 2014.
Outline
� Technical context and challenges
� Contributions � Self-healing of workflow executions on grids
� Treatment of blocked activities � Optimization of task granularity � Fairness control among workflow executions
� Conclusions
42
Contributions Summary
43
Self-healing of workflow incidents - Generic MAPE-K loop - Non-clairvoyance and online
[Ferreira da Silva et al., CCGRID’12, FGCS’13]
Treatment of blocked activities - Properly detects and handles blocked activities
Optimization of task granularity - Properly detects and handles lightweight tasks under
stationary and non-stationary loads
[Ferreira da Silva et al., Euro-Par’13a]
Fairness control among workflow executions - Properly detects and handles unfairness among
workflow executions
[Ferreira da Silva et al., Euro-Par’13b, CPE’14]
Science-gateway model for workload archive - Illustration by using traces of the VIP from 2011/2012
[Ferreira da Silva and Glatard, CGWS’12]
All methods were evaluated on VIP - Production platform with about 500 users
[Ferreira da Silva et al., HealthGrid’11; Glatard et al., TMI’13]
Perspectives
44
� Mode detection automation
� Automatically detect variation on threshold values
� Time-windowed historical information
� User’s behavior may change
� Errors may be restricted to a specific time span
� Optimization of the incident selection method � There is no mechanism to prevent an incident to be successively selected
� Sensitivity analysis of parameters � Evaluate the influence of parameters on the metrics
� Workflow workload archive
� The science gateway workload archive model does not embrace all characteristics inherent to a workflow execution
Thank you for your attention. Questions?
http://vip.creatis.insa-lyon.fr!
Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
A science-gateway for workflow executions: online and non-clairvoyant self-healing
of workflow executions on grids
Supervisors: Frédéric DESPREZ and Tristan GLATARD