View
219
Download
0
Category
Preview:
Citation preview
NetSolve
Henri Casanova and Jack DongarraUniversity of Tennessee and Oak Ridge National Laboratoryhttp://www.cs.utk.edu/netsolve
Objectives
Harnessing vast computational resources on the network Hardware Software
Convenient for scientific computing community Reducing installation and programming
overhead Masking complexity related to distributed
computing
Computation-Sharing Models Proxy Computing
Data
CodeDataCode
Client Server
Computation on the server
Computation-Sharing ModelsCode Shipping
CodeData
Client Server
Computation on the client
Code
Computation-Sharing ModelsRemote Computation
DataData
Client Server
Computation on the server
Code
Design issues
Platform independence to accommodate heterogeneityUser friendlyExtensibilityLoad balancingFault tolerance
NetSolve Architecture
“OS”
Resources
NetSolve Organization and Operation
NetSolve Client Interface
C, Fortran, Java, Matlab, and Mathematica
>> a = rand(100); b= rand(100,1);>> x = netsolve(’ax = b’, a, b);
>> a = rand(100); b= rand(100,1);>> request = netsolve_nb (’send’, ’ax = b’, a, b);>> x = netsolve_nb(’probe’, request);
Not ready>> x= netsolve_nb(’wait’, request);
NetSolve Wrappers
Problem description file for extensibility@PROBLEM ipars@INCLUDE ”ipars.h”@LIB /home/user/lib/libipars.a@DECRIPTIONParallel Sub-Surface Flow Simulator@INPUT 2@OBJECT STRING CHAR model@OBJECT FILE CHAR infile
Compiled into wrappers around scientific librariesXDR for platform-independent data transfer
NetSolve Load Balancing
Assigning a task to the “best” machine Establishing a performance model
Network delay, server properties, task properties Measuring and monitoring dynamic system
states
Load balancing at a finer granularity Parallelism through non-blocking interface Task migration
NetSolve Fault Tolerance
Inter-server fault toleranceFault tolerance among NetSolve
servers
Intra-server fault toleranceFault tolerance within a NetSolve
server
NetSolve Fault Tolerance Inter-server Fault Tolerance
Performed by NetSolve agentsBasic approach Failure detection + task reallocation Overload detection + task migration
Introducing NetSolve storage servers Store checkpoints or any information related
to fault tolerance (must be platform-independent)
No reliance on failed or overloaded server for task migration
NetSolve Fault ToleranceIntra-server Fault Tolerance
Not a new problemCould be invisible to NetSolveCan take advantage of platform-specific features for fault tolerancePossible integration with inter-server fault tolerance
Diskless Checkpointing Checksums and Reverse Computation
Diskless checkpointing eliminates the need for stable storageN servers + a checkpointing server At any point, consistent checkpoints taken
at N servers (stored in memory) A checksum of checkpoints stored at the
checkpointing server Rollback using reverse computation State recovery using the checksum
Applications
MCell with NetSolveLarge code, small data
Matlab with NetSolveTradeoffs between parallelism and
overhead
IPARS with NetSolveImageVision with NetSolve
Integration with ScaLAPACK
Integration with Condor
Integration with Ninf
Conclusion
An interesting infrastructure for sharing computational resourcesBoth software and hardware
Convenience, performance, and reliabilityPlayground for fault tolerance Both general and specific
Recommended