View
15
Download
1
Category
Preview:
DESCRIPTION
sreeni
Citation preview
INTRODUCTION
Grid and cluster architectures have gained popularity for computationally intensive parallel applications. However, the complexity of the infrastructure, consisting of computational nodes, mass storage, and interconnection networks, poses great challenges with respect to overall system reliability. Simple tools of reliability analysis show that as the complexity of the system increases, its reliability, and thus, Mean Time to Failure (MTTF), decreases. If one models the system as a series reliability block diagram, the reliability of the entire system is computed as the product of the reliabilities of all system components. For applications executing on large clusters or a Grid, e.g., Grid5000, the long execution times may exceed the MTTF of the infrastructure and, thus, render the execution infeasible. As an example, let us consider an execution lasting 10 days in a system that does not consider fault tolerance. Under the optimistic assumption that the MTTF of a single node is 2,000 days, the probability of failure of this long execution using 100, 200, or 500 nodes is 0.39, 0.63, or 0.91, respectively, approaching fast certain failure. The high failure probabilities are due to the fact that, in the absence of fault-tolerance mechanisms, the failure of a single node will cause the entire execution to fail. Note that this simple example does not even consider network failures, which are typically more likely than computer failure. Fault tolerance is, thus, a necessity to avoid failure in large applications, such as found in scientific computing, executing on a Grid, or large cluster.
3. SYSTEM ANALYSIS
Existing System:
Communication Induced Check-pointing protocols usually make the assumption
that any process can be check-pointed at any time. An alternative approach which
releases the constraint of always check- point table processes, without delaying any
do not message reception nor did altering message ordering enforce by the
communication layer or by the application. This protocol has been implemented
within Pro-Active, an open source Java middleware for asynchronous and
distributed objects implementing the ASP (Asynchronous Sequential Processes)
model.
Proposed System:
This paper presents two fault-tolerance mechanisms called Theft-Induced Check
pointing and Systematic Event Logging. These are transparent protocols capable of
overcoming problems associated with both benign faults, i.e., crash faults, and
node or subnet volatility. Specifically, the protocols base the state of the execution
on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous
systems as well as multithreaded applications.
DATA FLOW DIAGRAMA data-flow diagram (DFD) is a graphical representation of the "flow" of
data through an information system. DFDs can also be used for the
visualization of data processing (structured design).
On a DFD, data items flow from an external data source or an internal data
store to an internal data store or an external data sink, via an internal
process.
A DFD provides no information about the timing or ordering of processes,
or about whether processes will operate in sequence or in parallel. It is
therefore quite different from a flowchart, which shows the flow of control
through an algorithm, allowing a reader to determine what operations will be
performed, in what order, and under what circumstances, but not what kinds
of data will be input to and output from the system, nor where the data will
come from and go to, nor where the data will be stored (all of which are
shown on a DFD).
Crash recoveryUse of
protocols
CRASH CLIENT PROCESS STOPPED
LOCAL/FORCED
TEC AND SEL
CENTRALSERVER
Receivedmessages
Dataflow diagram
Architecture Diagram
Use case diagram:
A use case diagram in the Unified Modeling Language (UML) is a type of
behavioral diagram defined by and created from a Use-case analysis. Its
purpose is to present a graphical overview of the functionality provided by a
system in terms of actors, their goals (represented as use cases), and any
dependencies between those use cases.
The main purpose of a use case diagram is to show what system functions
are performed for which actor. Roles of the actors in the system can be
depicted.
Local/Forced
Crash
Transmission
Server
Client
Class Diagram:
Class diagrams are the mainstay of object-oriented analysis and design.
Class diagrams show the classes of the system, their interrelationships (including
inheritance, aggregation, and association), and the operations and attributes of the
classes. Class diagrams are used for a wide variety of purposes, including both
conceptual/domain modeling and detailed design modeling.
Client1MessageCrash
Run()ActionPerformed()GetMessage()
Client2MessageCrash
Run()ActionPerformed()GetMessage()
CrashClient1Clent2
Crash()ActionPerformed()
RecoveryTICSEL
DB()ActionPerformed()
ServerClientCountCrashRecovery
UpdateStatus()Recovery()
1..n
Sequence Diagram:
A sequence diagram in Unified Modeling Language (UML) is a kind of
interaction diagram that shows how processes operate with one another and
in what order. It is a construct of a Message Sequence Chart.
Sequence diagrams are sometimes called Event-trace diagrams, event
scenarios, and timing diagrams.
A sequence diagram shows, as parallel vertical lines (lifelines), different
processes or objects that live simultaneously, and, as horizontal arrows, the
messages exchanged between them, in the order in which they occur. This
allows the specification of simple runtime scenarios in a graphical manner.
Client1 Client2 ServerCrash
SendMsg
Ack
Crash
Recovery
Collaboration Diagram:
Collaboration diagrams belong to a group of UML diagrams called Interaction
Diagrams. Collaboration diagrams, like Sequence Diagrams, show how objects
interact over the course of time. However, instead of showing the sequence of
events by the layout on the diagram, collaboration diagrams show the sequence by
numbering the messages on the diagram. This makes it easier to show how the
objects are linked together, but harder to see the sequence at a glance.
Server
Client1 Client2
Crash
1: SendMsg
2: Ack
3: Crash
4: Recovery
Activity Diagram:
Activity diagrams are typically used for business process modeling, for modeling
the logic captured by a single use case or usage scenario, or for modeling the
detailed logic of a business rule. Although UML activity diagrams could
potentially model the internal logic of a complex operation it would be far better to
simply rewrite the operation so that it is simple enough that you don’t require an
activity diagram. In many ways UML activity diagrams are the object-oriented
equivalent of flow charts and data flow diagrams (DFDs) from structured
development.
Client
Sending
Crash
Local/Forced Transmission
sender
message
Receiver
crash
Local/Forced
State Diagram
Modules:1. Network Module
2. Logging Module
3. Check-pointing Module
4. Work Stealing Module
5. Fault and Fault Free Module
Module Description:
1. Network Module
Client-server computing or networking is a distributed application architecture
that partitions tasks or workloads between service providers (servers) and
service requesters, called clients. Often clients and servers operate over a
computer network on separate hardware. A server machine is a high-
performance host that is running one or more server programs which share its
resources with clients. A client also shares any of its resources; Clients
therefore initiate communication sessions with servers which await (listen to)
incoming requests.
2. Logging Module
Logging can be classified as pessimistic, optimistic, or causal. It is based on the
fact that the execution of a process can be modeled as a sequence of state intervals.
The execution during a state interval is deterministic. However, each state interval
is initiated by a nondeterministic event. Now, assume that the system can capture
and log sufficient information about the nondeterministic events that initiated the
state interval. This is called the piecewise deterministic (PWD) assumption .Then,
a crashed process can be recovered by 1) restoring it to the initial state and 2)
replaying the logged events to it in the same order they appeared in the execution
before the crash. To avoid a rollback to the initial state of a process and to limit the
amount of nondeterministic events that need to be replayed, each process
periodically saves its local state. Log-based mechanisms in which the only
nondeterministic events in a system are the reception of messages is usually
referred to as message logging.
3. Check-pointing Module
Rather than logging events, checkpointing relies on periodically saving the state of
the computation to stable storage. If a fault occurs, the computation is restarted
from one of the previously saved states. Since the computation is distributed, one
has to consider the tradeoff space of local and global checkpointing strategies and
their resulting recovery cost. Thus, checkpointing based methods differ in the way
processes are coordinated and in the derivation of a consistent global state. The
consistent global state can be achieved either at the time of checkpointing or at the
time of rollback recovery. The two approaches are called coordinated and
uncoordinated checkpointing, respectively
4. Work Stealing Module
The runtime environment and primary mechanism for load distribution is based on
a scheduling algorithm called work-stealing .The principle is simple: when a
process becomes idle it tries to steal work from another process called victim. The
initiating process is called thief. Work-stealing is the only mechanism for
distributing the workload constituting the application, i.e., an idle process seeks to
steal work from another process. From a practical point of view, the application
starts with the process executing main (), which creates tasks. Typically, some of
these tasks are then stolen by idle processes, which are either local or on other
processors. Thus, the principal mechanism for dispatching tasks in the distributed
environment is task stealing
5. Fault and Fault Free Module
We add a checkpointing mechanism; it is of special interest to analyze its overhead
associated with fault-free execution, since the occurrence of faults is considered to
be the rare exception rather than the norm.
Hardware Requirements:
• System : Pentium IV 2.4 GHz.
• Hard Disk : 40 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• Ram : 256 Mb.
Software Requirements:
• Operating system : - Windows XP Professional.
• Coding Language : - Java.
• Tool Used : - Eclipse.
Recommended