NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Preview:

Citation preview

NetSolve

Henri Casanova and Jack DongarraUniversity of Tennessee and Oak Ridge National Laboratoryhttp://www.cs.utk.edu/netsolve

Objectives

Harnessing vast computational resources on the network Hardware Software

Convenient for scientific computing community Reducing installation and programming

overhead Masking complexity related to distributed

computing

Computation-Sharing Models Proxy Computing

Data

CodeDataCode

Client Server

Computation on the server

Computation-Sharing ModelsCode Shipping

CodeData

Client Server

Computation on the client

Code

Computation-Sharing ModelsRemote Computation

DataData

Client Server

Computation on the server

Code

Design issues

Platform independence to accommodate heterogeneityUser friendlyExtensibilityLoad balancingFault tolerance

NetSolve Architecture

“OS”

Resources

NetSolve Organization and Operation

NetSolve Client Interface

C, Fortran, Java, Matlab, and Mathematica

>> a = rand(100); b= rand(100,1);>> x = netsolve(’ax = b’, a, b);

>> a = rand(100); b= rand(100,1);>> request = netsolve_nb (’send’, ’ax = b’, a, b);>> x = netsolve_nb(’probe’, request);

Not ready>> x= netsolve_nb(’wait’, request);

NetSolve Wrappers

Problem description file for extensibility@PROBLEM ipars@INCLUDE ”ipars.h”@LIB /home/user/lib/libipars.a@DECRIPTIONParallel Sub-Surface Flow Simulator@INPUT 2@OBJECT STRING CHAR model@OBJECT FILE CHAR infile

Compiled into wrappers around scientific librariesXDR for platform-independent data transfer

NetSolve Load Balancing

Assigning a task to the “best” machine Establishing a performance model

Network delay, server properties, task properties Measuring and monitoring dynamic system

states

Load balancing at a finer granularity Parallelism through non-blocking interface Task migration

NetSolve Fault Tolerance

Inter-server fault toleranceFault tolerance among NetSolve

servers

Intra-server fault toleranceFault tolerance within a NetSolve

server

NetSolve Fault Tolerance Inter-server Fault Tolerance

Performed by NetSolve agentsBasic approach Failure detection + task reallocation Overload detection + task migration

Introducing NetSolve storage servers Store checkpoints or any information related

to fault tolerance (must be platform-independent)

No reliance on failed or overloaded server for task migration

NetSolve Fault ToleranceIntra-server Fault Tolerance

Not a new problemCould be invisible to NetSolveCan take advantage of platform-specific features for fault tolerancePossible integration with inter-server fault tolerance

Diskless Checkpointing Checksums and Reverse Computation

Diskless checkpointing eliminates the need for stable storageN servers + a checkpointing server At any point, consistent checkpoints taken

at N servers (stored in memory) A checksum of checkpoints stored at the

checkpointing server Rollback using reverse computation State recovery using the checksum

Applications

MCell with NetSolveLarge code, small data

Matlab with NetSolveTradeoffs between parallelism and

overhead

IPARS with NetSolveImageVision with NetSolve

Integration with ScaLAPACK

Integration with Condor

Integration with Ninf

Conclusion

An interesting infrastructure for sharing computational resourcesBoth software and hardware

Convenience, performance, and reliabilityPlayground for fault tolerance Both general and specific