Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Fault Tolerant Grid Workflowin Water Threat Management

Master’s project / thesis seminar

Young Suk Moon

Chair: Prof. Gregor von Laszewski Reader: Observer:

Outline

Brief summary of Water Threat Management

Goal of the project

Background for my topic

Dynamic job scheduling

Fault tolerant grid systems

My ideas

Water Threat Management project

Analyzing contamination of water in urban water distribution systems

Sensor Data OptimizationEngine

Grid Resources

MiddleWare

SimulationEngine(MPI)

EPANET

EPANET

EPANET

EPANET

find the contaminant

source

find the optimal solution

Goal of the project

Problems of the current WTM system (MPI) Not fault tolerant

All computing should restart from the beginning

in case of node failure

Decision Change MPI systems to loosely coupled systems

Problems to solve

Run-time job scheduling

Fault tolerance

Background: Dynamic resource selection

Job Queue

Machines

Jobs

Performance DB

Select machine

Background: Fault tolerance in grid

Replication

Run the same job in multiple nodes

Need more resources

Checkpoint-restart

Checkpoint server

Slow due to checkpoint overhead

My ideas: Multiple-enqueue and Discard

Global Queue

Jobs 135 4 2

Machine A

Machine C

Machine B

queue A

queue C

queue B

My ideas: Multiple-enqueue and Discard

Global Queue

Jobs 6810 9 7

Machine A

Machine C

Machine B

23 123 123 1

23 123 145 3

23 123 134 2

queue A

queue C

queue B

Issues

How many duplicated jobs to enqueue

How to allocate which jobs to which machines How to divide jobs or input data How to cluster nodes

Evaluation

Comparison based on the different settings

References

G. von Laszewski, K. Mahinthakumar, R. Ranjithan, D. Brill, J. Uber, K. Harrison, S. Sreepathi, and E. Zechman, “An Adaptive Cyberinfrastructure for Threat Management in Urban Water Distribution Systems,” in Proceedings of ICCS 2006, vol. 3993, 2006, pp. 401–.

S. Sreepathi, “CYBERINFRASTRUCTURE FOR CONTAMINATION SOURCE CHARACTERIZATION IN WATER DISTRUBUTION SYSTEMS,” Master’s thesis, North Carolina State University, 2006

G. von Laszewski, “A Loosely Coupled Metacomputer: Cooperating Job Submissions Across Multiple Supercomputing Sites,” Concurrency, Experience, and Practice, vol. 11, no. 5, pp. 933–948, Dec. 1999

L. Ramakrishnam and D. A. Reed, “Performability modeling for scheduling and fault tolerance strategies for scientific workflows,” in Proceedings of the 17th international symposium on High performance distributed computing, 2008.

S. Ayyub and D. Abramson, “GridRod - A Dynamic Runtime Scheduler for Grid Workflows,” in Proceedings of the 21st annual international conference on Supercomputing, 2007.

Documents

Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar