Upload
maren
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Systems Seminar Schedule. 1 October - Douglas Thain “Error Management in Virtual Operating System” 15 October - Andrea Arpaci-Dusseau “Information and Control in Gray-Box Systems” 29 October - John Bent “Creating Communities for Grid I/O” 12 November - Open 26 November - Open - PowerPoint PPT Presentation
Citation preview
Systems Seminar ScheduleSystems Seminar Schedule 1 October - Douglas Thain
– “Error Management in Virtual Operating System” 15 October - Andrea Arpaci-Dusseau
– “Information and Control in Gray-Box Systems” 29 October - John Bent
– “Creating Communities for Grid I/O” 12 November - Open 26 November - Open 10 December - Open
Error ManagementError Managementin a Virtualin a Virtual
Operating SystemOperating System
Douglas ThainCondor Project
University of Wisconsin
What is a Virtual OS?What is a Virtual OS?
Hardware
Operating System
Device Drivers
Virtual OS 1
App 1
Device Drivers
Virtual OS 2
App 2
Device Drivers
App 3
App 4
Why Use a Virtual OS?Why Use a Virtual OS?
To test and deploy software that would otherwise require destructive changes. (Wine, User Mode Linux)
To improve indirection or fault-tolerance. (Rocks, Socks, Grid Console)
To transparently harness exterior resources. (UFO, Condor, PFS)
Harness the GridHarness the GridVirtual OS 2Virtual OS 1
App 1 App 2 App 3App 4
In a Standard OS,In a Standard OS,Errors are not DifficultErrors are not Difficult
Layers are members of a unified engineering effort.
A standard namespace and scheme are used end-to-end.
Most interfaces closely resemble the underlying implementation.
Most catastrophic failures are coordinated.
DeviceDriver
errno
FileSystem
OSKernel
errno
StandardLibrary
errno
errno
App
Handling Errors is a Handling Errors is a SeriousSerious Problem On the Grid Problem On the Grid
It is an important problem to solve:– As systems grow more complex, MTBF->0.– Failures are generally uncoordinated.– Propagating knowledge of failure is more important
than increasing likelihood of success. It is a difficult problem to solve:
– Theoretical: Matching different abstractions.– Technical: Mating different langauges and conventions.– Social: Coordinating distinct engineering efforts.
Error Error Management:Management:
A ProblemA Problemof Depthof Depth
VirtualOS
App
FTPDriver
GlobusFTP
Library
GlobusFTP
Library
UnitreeOS
FTPServer
POSIX
Uni
tree
Glo
bus
FTP
Globus
DDI
DiskCache
TapeArchive
DDI
DDI
A Problem of WidthA Problem of Width
Virtual Operating System
errno
App
UNIXDriver
SRBDriver
FTPDriver
NeSTDriver
KangarooDriver
GlobusGASSDriver
An Alphabet Soup of Protocols, APIs, Systems, Authorities, and
Authors
A Problem ofA Problem ofDesign DirectionDesign Direction
BottomUp
Design
App
ApplicationLibrary
StandardLibrary
OSKernel
errno
errno
???
App
Virtual OS
FTPDriver
FTPLibrary
Globus
DDI
errno
Outside In
Design
How do weHow do we
correctlycorrectlyrepresent errorsrepresent errors
in ain avirtual operating system?virtual operating system?
Spirit of this TalkSpirit of this Talk Software design involves striking balances -- there
is no trivial answer. Concentrate on presenting several concrete
problems and working solutions. Given these “data points,” I will present some
reasonable generalizations. Languages and conventions are ancillary issues.
– e.g. Exceptions vs. signals vs. errnos Discussion and disagreement are welcome!
The Pluggable File System
LocalDriver
SRBDriver
KangarooDriver
KangarooLibrary
SRBLibrary
GridFTPDriver
GridFTPLibrary
NeSTDriver
NeSTLibrary
HTTPDriver
HTTPLibrary
App
Bypass
Grid Services
Host Operating System
Examples of PFSExamples of PFS
% vi /gsiftp/vulture.cs.wisc.edu/etc/hosts
% grep phone /http/www.cs.wisc.edu/
% gcc /nest/turkey.cs.wisc.edu/input.c -o /kangaroo/khaki.ncsa.uiuc.edu/output
The Pluggable File System
A Kernel on Top of a KernelA Kernel on Top of a Kernel
LocalDriver
SRBDriver
KangarooDriver
GridFTPDriver
NeSTDriver
HTTPDriver
Host Operating System
0 1 2 3 4 5 6 7 8 9 10 11 12
65 1001 0 150 126
/tmp/input/gsiftp/host/out.10
/srb/host
/tmp/data
/kangaroo/host
/etc/hosts
File Descriptors
File Pointers
File Objects
CurrentWorkingDirectory
MountTable
namei
Not a Not a CompleteComplete Virtual OS Virtual OS Does not address process management,
synchronization, etc... Complete enough to be put to good use with real,
non-trivial applications.– Gaussian - atomic model simulation– CMSIM - simulation of CERN LHC– POVray - ray tracing software
Structure and concept are developed enough to explore other OS issues… others welcome!
Top-Level Error SpaceTop-Level Error Space A single namespace of
integer errors that apply to all levels of the system.
Any call is free to return any possible error. (124)
General vs specific:– ENOENT vs ECHILD
Some artifacts:– EACCESS vs EPERM– EADV and EDOTDOT
EPERM 1 /* Operation not permitted */ENOENT 2 /* No such file or directory */ESRCH 3 /* No such process */EINTR 4 /* Interrupted system call */EIO 5 /* I/O error */ENXIO 6 /* No such device or address */E2BIG 7 /* Arg list too long */ENOEXEC 8 /* Exec format error */EBADF 9 /* Bad file number */ECHILD 10 /* No child processes */EAGAIN 11 /* Try again */ENOMEM 12 /* Out of memory */EACCES 13 /* Permission denied */..
Concrete ProblemsConcrete Problemsand Solutionsand Solutions
Too little information - file transfer replies (FTP)– Stick your head in the sand.– Grope in the dark.– Never forget a face.
Too much information - infinite namespace (SRB)– Divide and conquer.– Appeal to a higher power.
New failure modes - login errors (Globus)– Take it easy.– Split hairs.
The Problem ofThe Problem ofToo Little InformationToo Little Information
Too Little Information:Too Little Information:FTP RepliesFTP Replies
Integer codes indicate the severity of a response to an action.
Many transfer problems are identified, but few file system problems are.
Third digit specified infrequently, and for wide classes of errors.
100 - Positive Preliminary200 - Positive Completion300 - Positive Intermediate400 - Transient Negative500 - Permanent negative
000 - Syntax010 - Information020 - Connections030 - Authentication040 - Unspecified050 - File System
550: “e.g. File not found, no access”
VirtualOS
FTPDriver
App
FTPServer
550: Pas de tellementlime ou repertoire...
GET datafile
open datafile
open datafile ?
ENOENT,EACCES,EISDIR...?
Too LittleToo LittleInformation:Information:FTP RepliesFTP Replies
Too little Information:Too little Information:“Stick your head in the sand”“Stick your head in the sand”
If you don’t understand the failure, keep trying until the result is acceptable.
Might work for transient errors.
Might even work for the savvy user that can identify and fix problems.
Too little Information:Too little Information:“Grope in the Dark”“Grope in the Dark”
if GET succeedsreturn success
elseif CHDIR succeeds
return EISDIRelse
if LIST succeedsreturn EACCESS
elsereturn ENOENT
endend
end
GET
CHDIR
LIST
EACCESS
Too little Information:Too little Information:“Never Forget a Face”“Never Forget a Face”
Each error condition has a signature:– Server identifier: “wuftpd 4.1 ftp.cs”– Operation attempted: “GET”– Message in reply: “550: Pas de tallenmand...”
First “Grope” and then cache the determined error along with the signature.
Problems:– Server must be consistent– Groping is not atomic
The Problem ofThe Problem ofToo Much InformationToo Much Information
Multiplexes many server backends into one client interface.
Error space is an amalgam of all back end error spaces.
Any call may return any error.
1026 and growing!
UNIX_EPERM -1301UNIX_ENOENT -1302. . .UNIX_EDEADLOCK -1356
HPSS_EPERM -1401HPSS_ENOENT -1402. . .HPSS_NOCOS -1499
MCAT_OPEN_ERROR -3001MCAT_CONNECT_ERROR -3002. . .MCAT_USER_NOT_IN_DOMN -3032
SQL_RSLT_TOO_LONG -1600HTTP_ERR_BAD_PATH -1700
Too Much Info:Too Much Info:SRB RepliesSRB Replies
UNIX_EPERM -1301UNIX_ENOENT -1302. . .UNIX_EDEADLOCK -1356
HPSS_EPERM -1401HPSS_ENOENT -1402. . .HPSS_NOCOS -1499
MCAT_OPEN_ERROR -3001MCAT_CONNECT_ERROR -3002. . .MCAT_USER_NOT_IN_DOMN -3032
SQL_RSLT_TOO_LONG -1600HTTP_ERR_BAD_PATH -1700
Too Much Information:Too Much Information:“Divide and Conquer”“Divide and Conquer”
EPERM
ENOENT
ESRCH
EINTR
EIO
EACCESS
EISDIR
OTHER
““Appeal to a Higher Power”Appeal to a Higher Power”
VirtualOS
SRBDriver
App
SRBServer
HPSS_NOCOS
open datafile
open datafile
open datafile
Throw an exception.Kill the process.Dump core.
“Cannot assign a COS.”A)bort R)etry F)ail?EACCESS, ENOENT, or EISDIR?
Human
OTHER
The Problem ofThe Problem ofNew Failure ModesNew Failure Modes
Identify
Certificate
VirtualOS
GSIDriver
App
GSIResource
Protocol Negotiation
open datafile
open datafile ?
EPERM,EACCES,EPROTO...?
Find IdentityAuthentication
Authorization
New Failure New Failure Modes:Modes:
Login ErrorsLogin Errors
GET datafile
Hierarchy of error objects, much like Java.
Errors may be identified by individual type or their membership in a more general type.
class Error {Error trigger;Module place_in_code;Object thing_in_question;String message;
};Error
Authen-tication
Author-ization
Commun-ication
NoCreds
ExpiredCreds
NoTrust
New Failure New Failure Modes:Modes:
Login ErrorsLogin Errors
New Failure Modes:New Failure Modes:“Take it Easy”“Take it Easy”
Easy for program to interpret and react.
Difficult for a human to debug.
EACCES
No identity
Couldn’tAuthenticate
NotAuthorized
ProtocolNot Supp.
New Failure Modes:New Failure Modes:“Split Hairs”“Split Hairs”
Preserves unique error types for the savvy user.
Program may not be prepared to react to arbitrary error values.
EPERM
No identity
Couldn’tAuthenticate
NotAuthorized
ProtocolNot Supp.
EACCES
EPROTO
ESRCH
New Failure Modes:New Failure Modes:Rocks SolutionRocks Solution
“Reliable Sockets” by Vic Zandy
Give a general error code along the standard channel.
Give a detailed message along a back channel.
ReliableSockets
StandardSockets
App
ConnectionRefused
ConnectionLost
rserrnoReconnection
TimeoutExpired
A Toolbox forA Toolbox forError ConversionsError Conversions
Simple Conversions:– “Take it Easy”– “Split Hairs”– “Divide and Conquer”
“Grope in the Dark”– “Never Forget a Face”
“Appeal to a Higher Power” “Stick your Head in the Sand”
IncreasingCost
Error Accuracy can beError Accuracy can beA Performance ConcernA Performance Concern
We can always find some way to produce a correct -- even if undesired -- execution.
But -– An “Appeal to a Higher Power” causes badput.– “Groping in the Dark” yields high latencies.– “Head in the Sand” may keep trying when no automatic
recovery is possible.– ...or, a failure to retry results in unnecessary user
interaction.
1 - Express errors in terms of the interface. 2 - Assume the audience is a program.3 - Leave room to expand, but avoid using it.4 - Give the essence, not the details.
Hints forHints forError DesignError Design
1 - Express Errors in Terms1 - Express Errors in Termsof the Interfaceof the Interface
Essence of separation of interface and implementation.
The user of an interface should not see a “moving target” as the implementation changes.
Application
FileInterface
DiskImpl
NetworkImpl
MemoryImpl ???
2 - Assume the Audience2 - Assume the Audienceis a Programis a Program
A computer-readable error can be used as the basis for a decision at any level.
A human-readable error can only result in a blind retry or an Appeal.
Computer-readable errors are easily made human-readable.
Layer2
Layer 0
Layer 1
Human
ErrorText
???
???
ErrorCode
Decision
Decision
Decision Decision
3 - Leave Room to Expand3 - Leave Room to Expand...but Avoid Using It...but Avoid Using It
Any significantly different implementation of an interface will introduce new failure modes.
Possibilities for a new failure:– Best case: fit it into an existing error.– Medium case: return “unknown error.”– Worst case: “Appeal to a Higher Power.”
4 - Give the Essence,4 - Give the Essence,not the Detailsnot the Details
The details distract the caller from the nature of the problem and result in cascading “Appeals.”
Example in file systems:– “Fell off the end of the directory linked list.”– or “No file by that name.”
Example in networking:– “Timer went off, but no network interrupt received.’– or “Connection lost.”
Example in security:– “Failure in PEM_do_header while reading password.”– or “You have no credentials.”
A restatement of hint #1.
All authors remain anonymous.– “Error in return value.”– “A system call failed!”– “Could not execute job.
Reason: Success”
Hall of Fame
In Summary...In Summary...Error management is part of the “art” of
software engineering.The importance and the difficulty of error
management are magnified in a virtual operating system.
All errors have some value, but low-signal errors result in performance problems.
Hints for error interface design.
Contact InfoContact Info
Douglas Thain– [email protected]
Software and other info:– http://www.cs.wisc.edu/condor/pfs
Questions and discussion?