Hasthi Lead Integration: A Case Study on System Management

07/02/09

Application of Management Frameworks to Manage Workflow-

based Systems:A Case Study on a Large Scale E-

Science Project

Srinath Perera, Suresh Marru, Thilina Gunarathne, Dennis Gannon, Beth Plale

Indiana University, Bloomington

07/02/09

SOA => Many Service Systems

• SOA leads to many Services Systems• Good: it is distributed, loosely coupled etc, but• Bad: Not very easy to manage, specially if it is

distributed across many machines• Ugly: System Management/ Administration

Nightmare• So with many Service Systems--most of them are

reasonably large scale---Systems management has become important as ever!

I have a System Management framework, am I Done?

Application of System Management is not Simple (some problems).• Building a generic framework for actions and

monitoring agents.• Identifying/ formulating management scenarios

given a system.• Handling the lost state in failed managed services,

what about lost messages?• What if Management action has failed, avoiding

loops if a management action has failed.• Notifying other services if a service location has

changed after recovery.

Case Study Based on Large Scale E-Science Project

• Enable Scientist to find interesting condition from weather data collected across united States, process them using National Computation resources (TeraGrid), and manage weather data, results, and their provenance

• Build using SOA based architecture, have 13+ persistent services and many services created on demand.

6

Hasthi Management Framework

• Enforces Undefined Management Logic (expressed as rules), and has a global view of the system.

• Scalable (to manage about 100,1000 services). • Robust -(Self-organizing, recovers from failures of both

resources and management framework)• Dynamic (discover components, keep track when

resources join and leave)

Proposed Integration Model of Hasthi with a Given System

7

Types of Management Agents8

Management Actions 9

• Action Types – Create a New service– Restart a running service or recover a failed service– Relocate a service– Tune and configure a resource – change the

configuration of a resource or change the structure of the system.

– User Interaction Action• Actions implementation:

– Use shell scripts (e.g. service start or stop) and execute them using a Host Agent running in each host.

– Use Hasthi Agent integrated with each resource.– Hasthi provides default management actions, but

users can write their own.

Handling Lost State

• If Service writes its state to a storage location and exposes the location as a parameter, Hasthi passes that location as a Argument to the new service.

• Hasthi acts as a Service registry, and helps services to find instances of other dependency services by a lookup. So services can recover other services via the lookup if a dependency service failed or at initialization.

Failed Management Actions

• Resource life cycle avoid Loops

• User interactions to delegate fixing the error to human users (send a email to user, user responds via clicking a link)

Fail Positives

• Vary Hard Problem, fact of systems.

• We use heartbeat + timeouts as indicators and

trigger (pluggable) failure detectors (e.g. active

pings, functional tests).

• Other Services timeouts can raise a faulty suspect

conditions and custom failure detectors are

activated.

LEAD E-Science Project• We confirmed 80-20 rule by analyzing LEAD error

data over an 18 months period where 30/80 (37%) different error types were responsible for 95% of all error occurrences.

• LEAD services write data to a database at once,

and has best effort global state (explain).

• Handling Errors in LEAD– Execution Errors – handled by multiple levels of

retires (e.g. file transfers / job submission retries,, run executions in different computational resources, part of LEAD).

– Hasthi handles infrastructure errors, and then recover failed workflows due to those errors.

Usecase As Rules

• Condition and a Action.• Failed Recovered Services by restarting or

moving (Real Rules can be complicated)

Rules: Detect Failed System, and Restart Workflows after failure.

Workflow Recovery

Evaluation: LEAD Integration17

• Hasthi recovers LEAD from services and host failures and recovers failed workflows.

• A) Killed a service B) killed a host and measured the time to detect, trigger actions, new resources to join, and detect healthy conditions. Take about 2 minutes to recover the system and to know it is healthy.

What does results Mean?

• Assume MTTF of a service is f, and services are

independent. Then MTTF of the system is f/26 (by Baumann

[8] assume 26 services).

• Using MTTR from above results, and assuming Hasthi do

not fail, Then Availability of the system is

• That is Availability of 0.995, 0.997, 0.999 with MTTF of 1

week, 2 weeks, 1months per service, which is 46.8, 26.3,

and 8.8 hours downtime per year .

Demo (If we have time)

• http://www.extreme.indiana.edu/hasthi/

lead/screencasts/hasthi4.htm

Questions20

Business

Hasthi Lead Integration: A Case Study on System Management