Upload
srinath-perera
View
1.053
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
07/02/09
Application of Management Frameworks to Manage Workflow-
based Systems:A Case Study on a Large Scale E-
Science Project
Srinath Perera, Suresh Marru, Thilina Gunarathne, Dennis Gannon, Beth Plale
Indiana University, Bloomington
07/02/09
SOA => Many Service Systems
• SOA leads to many Services Systems• Good: it is distributed, loosely coupled etc, but• Bad: Not very easy to manage, specially if it is
distributed across many machines• Ugly: System Management/ Administration
Nightmare• So with many Service Systems--most of them are
reasonably large scale---Systems management has become important as ever!
I have a System Management framework, am I Done?
Application of System Management is not Simple (some problems).• Building a generic framework for actions and
monitoring agents.• Identifying/ formulating management scenarios
given a system.• Handling the lost state in failed managed services,
what about lost messages?• What if Management action has failed, avoiding
loops if a management action has failed.• Notifying other services if a service location has
changed after recovery.
Case Study Based on Large Scale E-Science Project
• Enable Scientist to find interesting condition from weather data collected across united States, process them using National Computation resources (TeraGrid), and manage weather data, results, and their provenance
• Build using SOA based architecture, have 13+ persistent services and many services created on demand.
6
Hasthi Management Framework
• Enforces Undefined Management Logic (expressed as rules), and has a global view of the system.
• Scalable (to manage about 100,1000 services). • Robust -(Self-organizing, recovers from failures of both
resources and management framework)• Dynamic (discover components, keep track when
resources join and leave)
Proposed Integration Model of Hasthi with a Given System
7
Types of Management Agents8
Management Actions 9
• Action Types – Create a New service– Restart a running service or recover a failed service– Relocate a service– Tune and configure a resource – change the
configuration of a resource or change the structure of the system.
– User Interaction Action• Actions implementation:
– Use shell scripts (e.g. service start or stop) and execute them using a Host Agent running in each host.
– Use Hasthi Agent integrated with each resource.– Hasthi provides default management actions, but
users can write their own.
Handling Lost State
• If Service writes its state to a storage location and exposes the location as a parameter, Hasthi passes that location as a Argument to the new service.
• Hasthi acts as a Service registry, and helps services to find instances of other dependency services by a lookup. So services can recover other services via the lookup if a dependency service failed or at initialization.
Failed Management Actions
• Resource life cycle avoid Loops
• User interactions to delegate fixing the error to human users (send a email to user, user responds via clicking a link)
Fail Positives
• Vary Hard Problem, fact of systems.
• We use heartbeat + timeouts as indicators and
trigger (pluggable) failure detectors (e.g. active
pings, functional tests).
• Other Services timeouts can raise a faulty suspect
conditions and custom failure detectors are
activated.
LEAD E-Science Project• We confirmed 80-20 rule by analyzing LEAD error
data over an 18 months period where 30/80 (37%) different error types were responsible for 95% of all error occurrences.
• LEAD services write data to a database at once,
and has best effort global state (explain).
• Handling Errors in LEAD– Execution Errors – handled by multiple levels of
retires (e.g. file transfers / job submission retries,, run executions in different computational resources, part of LEAD).
– Hasthi handles infrastructure errors, and then recover failed workflows due to those errors.
Usecase As Rules
• Condition and a Action.• Failed Recovered Services by restarting or
moving (Real Rules can be complicated)
Rules: Detect Failed System, and Restart Workflows after failure.
Workflow Recovery
Evaluation: LEAD Integration17
• Hasthi recovers LEAD from services and host failures and recovers failed workflows.
• A) Killed a service B) killed a host and measured the time to detect, trigger actions, new resources to join, and detect healthy conditions. Take about 2 minutes to recover the system and to know it is healthy.
What does results Mean?
• Assume MTTF of a service is f, and services are
independent. Then MTTF of the system is f/26 (by Baumann
[8] assume 26 services).
• Using MTTR from above results, and assuming Hasthi do
not fail, Then Availability of the system is
• That is Availability of 0.995, 0.997, 0.999 with MTTF of 1
week, 2 weeks, 1months per service, which is 46.8, 26.3,
and 8.8 hours downtime per year .
Demo (If we have time)
• http://www.extreme.indiana.edu/hasthi/
lead/screencasts/hasthi4.htm
Questions20