SERC Research Seminar Day August 18, 2007 Predictions for Parallel Applications and Systems Sathish...

SERC Research Seminar DayAugust 18, 2007

Predictions for Parallel Applications and Systems

Sathish VadhiyarGrid Applications Research Laboratory (GARL)

GARL Research• Grid Applications

– Climate Modeling– Gene Mutations

• Performance Modeling• Rescheduling• Others

– Prediction of queue wait times

Rescheduling

• The base is a parallel checkpointing library called SRS

• Checkpointing? – storing application’s state so as to continue from the previous state after interruption

• Interruption either by a scheduler or system faults

• SRS allows processor reconfiguration

Application Progress

System 1

Storage

System 2

Optimal Checkpoint Interval

• Storing checkpoints periodically will help in fault-tolerance

•How periodic?• What is the optimal checkpoint interval?

– More checkpointing will lead to increased checkpoint overhead

– Less checkpointing frequency will lead to increase times for recovery from failures

Illustration

Dynamic Determination of Optimal Checkpointing Intervals

• Start the application on a set of resources

• Predict the next failure on the set of resources

• Checkpoint “just before” the next failure• The prediction has to be really accurate• But no prediction can be 100% accurate

Probability Distribution of Failures

• Use a probability distribution of failures on the resources

• Need to know: The next time of failure with x% certainty

• But more certainty is also not good

Markov Chains

For parallel M-M checkpointing

In SRS, there is almost no system down phase

For sequential applications

In SRS, transition from state 0 can lead to many states

Motivation for Queue Wait Times

• A Grid consisting of number of batch queues

• A meta system that will:– predict the wait times and execution

times of jobs– Decide which queue is “most

suitable” for the job

What is a good predictor?

• There are number of prediction strategies• Evaluating a predictor’s goodness:

1. Mean Absolute Percentage Error (MAPE)2. Upper bound for actual/predicted3. Average of (actual-predicted) [absolute error]4. Absolute error/actual wait time [relative error]5. Average error/average queue wait time6. Coefficient of correlation

• Each of these metrics has flaws

Illustration

Method 1 Method 2

Metric 3 value of Method 1 < Metric 3 value of Method 2

i.e. Method 1 is better

Our goals

• To define useful metrics that can clearly say whether a method is “good” or “bad”

• Goodness of predictors– In terms of absolute wait times– In terms of execution times– In terms of resource demand

Illustration:Prediction errors versus absolute wait times

(A-P)/A%

Wait times

y1x1, y1

x2, y2

Reality??

What we want to do…

• Define metrics that can evaluate a method in the “absolute” sense, not “comparative” sense– Stare at a single graph and ask “Is this graph good”

as much as possible

• In some cases, it may just not be possible– Use comparisons

• Evaluate the existing methods on these sets of metrics

• Come up with a method that performs the best in terms of all of the defined metrics

Motivation

• Certain large computational phases of climate modeling (CCSM) are done only by some processors

• Load balancing – offload work from these processors to other processors– Increased processor utilization– Decreased execution time

• How much offloading?– Need to predict workload based on previous

computations

What is happening…

Proc 0 Proc 1 Proc 2 Proc 3 Proc 4

Phase 1

Phase 2

What should happen…

Proc 0 Proc 1 Proc 2 Proc 3 Proc 4

Phase 1

Phase 2

For this, we need to know the workload in phase 1

We predict the workload based on previous time steps

Advantages

GARLians

• Yadnyesh Joshi (M.Sc)• Karthikeyan Raman (M.Tech, jointly with Prof.

Govindarajan)• H.A. Sanjay (Ph.D, jointly with Prof. Ravi Nanjundiah, CAOS)• Sivagama Sundari (Ph.D)• Ashish Srivatsava (Project Assistant)• Alumni

– 1 student intern from INSA, Lyon, France– Summer interns– Project assistants– 2 M.Scs

Questions ????

http://garl.serc.iisc.ernet.in

SERC Research Seminar Day August 18, 2007 Predictions for Parallel Applications and Systems Sathish...

Documents

Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra

Nimrod & NetSolve Sathish Vadhiyar. Nimrod Sources/Credits: Nimrod web site & papers

List Ranking and Parallel Prefix Sathish Vadhiyar

APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD

Simulation, Emulation Sathish Vadhiyar Sources / Credits: Microgrid, Simgrid

Apache Hadoop India Summit 2011 talk "Middleware Frameworks for Adaptive Executions and Visualizations of Climate and Weather Applications on Grids" by Sathish Vadhiyar

SE 292: High Performance Computing Memory Organization and Process Management Sathish Vadhiyar

Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC

Grid Computing – Introduction Sathish Vadhiyar. Generic Grid Architecture/Components Resource Layer High speed networks and routers Computers Data bases

Collective Communication Implementations Sathish Vadhiyar

Network Weather Service Sathish Vadhiyar Sources / Credits: NWS web site: ://nws.cs.ucsb.edu NWS papers

MPI-2 Sathish Vadhiyar Using MPI2: Advanced Features of the Message-Passing

Globus – Part II Sathish Vadhiyar. Globus Information Service

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming

GARL a Division of NRA/GRA - gatech.edu

Fault Tolerance and Checkpointingcds.iisc.ac.in/wp-content/uploads/FaultTolerance.pdf · Fault Tolerance and Checkpointing - Sathish Vadhiyar. Introduction Checkpointing? storing

Grid Standards & Forums, Summary Sathish Vadhiyar

Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

Application-level Scheduling Sathish S. Vadhiyar Credits / Sources: AppLeS web pages and papers