12
Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar Young Suk Moon Chair: Prof. Gregor von Laszewski Reader: Observer:

Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Embed Size (px)

DESCRIPTION

Young Suk Moon Chair: Prof. Gregor von Laszewski Reader: Observer:. Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar. Outline. Brief summary of Water Threat Management Goal of the project - PowerPoint PPT Presentation

Citation preview

Page 1: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Fault Tolerant Grid Workflowin Water Threat Management

Master’s project / thesis seminar

Young Suk Moon

Chair: Prof. Gregor von Laszewski Reader: Observer:

Page 2: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Outline

Brief summary of Water Threat Management

Goal of the project

Background for my topic

Dynamic job scheduling

Fault tolerant grid systems

My ideas

Page 3: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Water Threat Management project

Analyzing contamination of water in urban water distribution systems

Sensor Data OptimizationEngine

Grid Resources

MiddleWare

SimulationEngine(MPI)

EPANET

EPANET

EPANET

EPANET

find the contaminant

source

find the optimal solution

Page 4: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Goal of the project

Problems of the current WTM system (MPI) Not fault tolerant

All computing should restart from the beginning

in case of node failure

Decision Change MPI systems to loosely coupled systems

Page 5: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Problems to solve

Run-time job scheduling

Fault tolerance

Page 6: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Background: Dynamic resource selection

Job Queue

Machines

Jobs

Performance DB

Select machine

Page 7: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Background: Fault tolerance in grid

Replication

Run the same job in multiple nodes

Need more resources

Checkpoint-restart

Checkpoint server

Slow due to checkpoint overhead

Page 8: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

My ideas: Multiple-enqueue and Discard

Global Queue

Jobs 135 4 2

Machine A

Machine C

Machine B

queue A

queue C

queue B

Page 9: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

My ideas: Multiple-enqueue and Discard

Global Queue

Jobs 6810 9 7

Machine A

Machine C

Machine B

23 123 123 1

23 123 145 3

23 123 134 2

queue A

queue C

queue B

Page 10: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Issues

How many duplicated jobs to enqueue

How to allocate which jobs to which machines How to divide jobs or input data How to cluster nodes

Page 11: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

Evaluation

Comparison based on the different settings

Page 12: Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

References

G. von Laszewski, K. Mahinthakumar, R. Ranjithan, D. Brill, J. Uber, K. Harrison, S. Sreepathi, and E. Zechman, “An Adaptive Cyberinfrastructure for Threat Management in Urban Water Distribution Systems,” in Proceedings of ICCS 2006, vol. 3993, 2006, pp. 401–.

S. Sreepathi, “CYBERINFRASTRUCTURE FOR CONTAMINATION SOURCE CHARACTERIZATION IN WATER DISTRUBUTION SYSTEMS,” Master’s thesis, North Carolina State University, 2006

G. von Laszewski, “A Loosely Coupled Metacomputer: Cooperating Job Submissions Across Multiple Supercomputing Sites,” Concurrency, Experience, and Practice, vol. 11, no. 5, pp. 933–948, Dec. 1999

L. Ramakrishnam and D. A. Reed, “Performability modeling for scheduling and fault tolerance strategies for scientific workflows,” in Proceedings of the 17th international symposium on High performance distributed computing, 2008.

S. Ayyub and D. Abramson, “GridRod - A Dynamic Runtime Scheduler for Grid Workflows,” in Proceedings of the 21st annual international conference on Supercomputing, 2007.