21
NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve

NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

NetSolve

Henri Casanova and Jack DongarraUniversity of Tennessee and Oak Ridge National Laboratoryhttp://www.cs.utk.edu/netsolve

Page 2: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Objectives

Harnessing vast computational resources on the network Hardware Software

Convenient for scientific computing community Reducing installation and programming

overhead Masking complexity related to distributed

computing

Page 3: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Computation-Sharing Models Proxy Computing

Data

CodeDataCode

Client Server

Computation on the server

Page 4: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Computation-Sharing ModelsCode Shipping

CodeData

Client Server

Computation on the client

Code

Page 5: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Computation-Sharing ModelsRemote Computation

DataData

Client Server

Computation on the server

Code

Page 6: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Design issues

Platform independence to accommodate heterogeneityUser friendlyExtensibilityLoad balancingFault tolerance

Page 7: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

NetSolve Architecture

“OS”

Resources

Page 8: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

NetSolve Organization and Operation

Page 9: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

NetSolve Client Interface

C, Fortran, Java, Matlab, and Mathematica

>> a = rand(100); b= rand(100,1);>> x = netsolve(’ax = b’, a, b);

>> a = rand(100); b= rand(100,1);>> request = netsolve_nb (’send’, ’ax = b’, a, b);>> x = netsolve_nb(’probe’, request);

Not ready>> x= netsolve_nb(’wait’, request);

Page 10: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

NetSolve Wrappers

Problem description file for extensibility@PROBLEM ipars@INCLUDE ”ipars.h”@LIB /home/user/lib/libipars.a@DECRIPTIONParallel Sub-Surface Flow Simulator@INPUT 2@OBJECT STRING CHAR model@OBJECT FILE CHAR infile

Compiled into wrappers around scientific librariesXDR for platform-independent data transfer

Page 11: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

NetSolve Load Balancing

Assigning a task to the “best” machine Establishing a performance model

Network delay, server properties, task properties Measuring and monitoring dynamic system

states

Load balancing at a finer granularity Parallelism through non-blocking interface Task migration

Page 12: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

NetSolve Fault Tolerance

Inter-server fault toleranceFault tolerance among NetSolve

servers

Intra-server fault toleranceFault tolerance within a NetSolve

server

Page 13: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

NetSolve Fault Tolerance Inter-server Fault Tolerance

Performed by NetSolve agentsBasic approach Failure detection + task reallocation Overload detection + task migration

Introducing NetSolve storage servers Store checkpoints or any information related

to fault tolerance (must be platform-independent)

No reliance on failed or overloaded server for task migration

Page 14: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

NetSolve Fault ToleranceIntra-server Fault Tolerance

Not a new problemCould be invisible to NetSolveCan take advantage of platform-specific features for fault tolerancePossible integration with inter-server fault tolerance

Page 15: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Diskless Checkpointing Checksums and Reverse Computation

Diskless checkpointing eliminates the need for stable storageN servers + a checkpointing server At any point, consistent checkpoints taken

at N servers (stored in memory) A checksum of checkpoints stored at the

checkpointing server Rollback using reverse computation State recovery using the checksum

Page 16: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Applications

MCell with NetSolveLarge code, small data

Matlab with NetSolveTradeoffs between parallelism and

overhead

IPARS with NetSolveImageVision with NetSolve

Page 17: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory
Page 18: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Integration with ScaLAPACK

Page 19: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Integration with Condor

Page 20: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Integration with Ninf

Page 21: NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Conclusion

An interesting infrastructure for sharing computational resourcesBoth software and hardware

Convenience, performance, and reliabilityPlayground for fault tolerance Both general and specific