44
Systems Seminar Systems Seminar Schedule Schedule 1 October - Douglas Thain “Error Management in Virtual Operating System” 15 October - Andrea Arpaci-Dusseau “Information and Control in Gray-Box Systems” 29 October - John Bent “Creating Communities for Grid I/O” 12 November - Open 26 November - Open 10 December - Open

Systems Seminar Schedule

  • Upload
    maren

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Systems Seminar Schedule. 1 October - Douglas Thain “Error Management in Virtual Operating System” 15 October - Andrea Arpaci-Dusseau “Information and Control in Gray-Box Systems” 29 October - John Bent “Creating Communities for Grid I/O” 12 November - Open 26 November - Open - PowerPoint PPT Presentation

Citation preview

Page 1: Systems Seminar Schedule

Systems Seminar ScheduleSystems Seminar Schedule 1 October - Douglas Thain

– “Error Management in Virtual Operating System” 15 October - Andrea Arpaci-Dusseau

– “Information and Control in Gray-Box Systems” 29 October - John Bent

– “Creating Communities for Grid I/O” 12 November - Open 26 November - Open 10 December - Open

Page 2: Systems Seminar Schedule

Error ManagementError Managementin a Virtualin a Virtual

Operating SystemOperating System

Douglas ThainCondor Project

University of Wisconsin

Page 3: Systems Seminar Schedule

What is a Virtual OS?What is a Virtual OS?

Hardware

Operating System

Device Drivers

Virtual OS 1

App 1

Device Drivers

Virtual OS 2

App 2

Device Drivers

App 3

App 4

Page 4: Systems Seminar Schedule

Why Use a Virtual OS?Why Use a Virtual OS?

To test and deploy software that would otherwise require destructive changes. (Wine, User Mode Linux)

To improve indirection or fault-tolerance. (Rocks, Socks, Grid Console)

To transparently harness exterior resources. (UFO, Condor, PFS)

Page 5: Systems Seminar Schedule

Harness the GridHarness the GridVirtual OS 2Virtual OS 1

App 1 App 2 App 3App 4

Page 6: Systems Seminar Schedule

In a Standard OS,In a Standard OS,Errors are not DifficultErrors are not Difficult

Layers are members of a unified engineering effort.

A standard namespace and scheme are used end-to-end.

Most interfaces closely resemble the underlying implementation.

Most catastrophic failures are coordinated.

DeviceDriver

errno

FileSystem

OSKernel

errno

StandardLibrary

errno

errno

App

Page 7: Systems Seminar Schedule

Handling Errors is a Handling Errors is a SeriousSerious Problem On the Grid Problem On the Grid

It is an important problem to solve:– As systems grow more complex, MTBF->0.– Failures are generally uncoordinated.– Propagating knowledge of failure is more important

than increasing likelihood of success. It is a difficult problem to solve:

– Theoretical: Matching different abstractions.– Technical: Mating different langauges and conventions.– Social: Coordinating distinct engineering efforts.

Page 8: Systems Seminar Schedule

Error Error Management:Management:

A ProblemA Problemof Depthof Depth

VirtualOS

App

FTPDriver

GlobusFTP

Library

GlobusFTP

Library

UnitreeOS

FTPServer

POSIX

Uni

tree

Glo

bus

FTP

Globus

DDI

DiskCache

TapeArchive

DDI

DDI

Page 9: Systems Seminar Schedule

A Problem of WidthA Problem of Width

Virtual Operating System

errno

App

UNIXDriver

SRBDriver

FTPDriver

NeSTDriver

KangarooDriver

GlobusGASSDriver

An Alphabet Soup of Protocols, APIs, Systems, Authorities, and

Authors

Page 10: Systems Seminar Schedule

A Problem ofA Problem ofDesign DirectionDesign Direction

BottomUp

Design

App

ApplicationLibrary

StandardLibrary

OSKernel

errno

errno

???

App

Virtual OS

FTPDriver

FTPLibrary

Globus

DDI

errno

Outside In

Design

Page 11: Systems Seminar Schedule

How do weHow do we

correctlycorrectlyrepresent errorsrepresent errors

in ain avirtual operating system?virtual operating system?

Page 12: Systems Seminar Schedule

Spirit of this TalkSpirit of this Talk Software design involves striking balances -- there

is no trivial answer. Concentrate on presenting several concrete

problems and working solutions. Given these “data points,” I will present some

reasonable generalizations. Languages and conventions are ancillary issues.

– e.g. Exceptions vs. signals vs. errnos Discussion and disagreement are welcome!

Page 13: Systems Seminar Schedule

The Pluggable File System

LocalDriver

SRBDriver

KangarooDriver

KangarooLibrary

SRBLibrary

GridFTPDriver

GridFTPLibrary

NeSTDriver

NeSTLibrary

HTTPDriver

HTTPLibrary

App

Bypass

Grid Services

Host Operating System

Page 14: Systems Seminar Schedule

Examples of PFSExamples of PFS

% vi /gsiftp/vulture.cs.wisc.edu/etc/hosts

% grep phone /http/www.cs.wisc.edu/

% gcc /nest/turkey.cs.wisc.edu/input.c -o /kangaroo/khaki.ncsa.uiuc.edu/output

Page 15: Systems Seminar Schedule

The Pluggable File System

A Kernel on Top of a KernelA Kernel on Top of a Kernel

LocalDriver

SRBDriver

KangarooDriver

GridFTPDriver

NeSTDriver

HTTPDriver

Host Operating System

0 1 2 3 4 5 6 7 8 9 10 11 12

65 1001 0 150 126

/tmp/input/gsiftp/host/out.10

/srb/host

/tmp/data

/kangaroo/host

/etc/hosts

File Descriptors

File Pointers

File Objects

CurrentWorkingDirectory

MountTable

namei

Page 16: Systems Seminar Schedule

Not a Not a CompleteComplete Virtual OS Virtual OS Does not address process management,

synchronization, etc... Complete enough to be put to good use with real,

non-trivial applications.– Gaussian - atomic model simulation– CMSIM - simulation of CERN LHC– POVray - ray tracing software

Structure and concept are developed enough to explore other OS issues… others welcome!

Page 17: Systems Seminar Schedule

Top-Level Error SpaceTop-Level Error Space A single namespace of

integer errors that apply to all levels of the system.

Any call is free to return any possible error. (124)

General vs specific:– ENOENT vs ECHILD

Some artifacts:– EACCESS vs EPERM– EADV and EDOTDOT

EPERM 1 /* Operation not permitted */ENOENT 2 /* No such file or directory */ESRCH 3 /* No such process */EINTR 4 /* Interrupted system call */EIO 5 /* I/O error */ENXIO 6 /* No such device or address */E2BIG 7 /* Arg list too long */ENOEXEC 8 /* Exec format error */EBADF 9 /* Bad file number */ECHILD 10 /* No child processes */EAGAIN 11 /* Try again */ENOMEM 12 /* Out of memory */EACCES 13 /* Permission denied */..

Page 18: Systems Seminar Schedule

Concrete ProblemsConcrete Problemsand Solutionsand Solutions

Too little information - file transfer replies (FTP)– Stick your head in the sand.– Grope in the dark.– Never forget a face.

Too much information - infinite namespace (SRB)– Divide and conquer.– Appeal to a higher power.

New failure modes - login errors (Globus)– Take it easy.– Split hairs.

Page 19: Systems Seminar Schedule

The Problem ofThe Problem ofToo Little InformationToo Little Information

Page 20: Systems Seminar Schedule

Too Little Information:Too Little Information:FTP RepliesFTP Replies

Integer codes indicate the severity of a response to an action.

Many transfer problems are identified, but few file system problems are.

Third digit specified infrequently, and for wide classes of errors.

100 - Positive Preliminary200 - Positive Completion300 - Positive Intermediate400 - Transient Negative500 - Permanent negative

000 - Syntax010 - Information020 - Connections030 - Authentication040 - Unspecified050 - File System

550: “e.g. File not found, no access”

Page 21: Systems Seminar Schedule

VirtualOS

FTPDriver

App

FTPServer

550: Pas de tellementlime ou repertoire...

GET datafile

open datafile

open datafile ?

ENOENT,EACCES,EISDIR...?

Too LittleToo LittleInformation:Information:FTP RepliesFTP Replies

Page 22: Systems Seminar Schedule

Too little Information:Too little Information:“Stick your head in the sand”“Stick your head in the sand”

If you don’t understand the failure, keep trying until the result is acceptable.

Might work for transient errors.

Might even work for the savvy user that can identify and fix problems.

Page 23: Systems Seminar Schedule

Too little Information:Too little Information:“Grope in the Dark”“Grope in the Dark”

if GET succeedsreturn success

elseif CHDIR succeeds

return EISDIRelse

if LIST succeedsreturn EACCESS

elsereturn ENOENT

endend

end

GET

CHDIR

LIST

EACCESS

Page 24: Systems Seminar Schedule

Too little Information:Too little Information:“Never Forget a Face”“Never Forget a Face”

Each error condition has a signature:– Server identifier: “wuftpd 4.1 ftp.cs”– Operation attempted: “GET”– Message in reply: “550: Pas de tallenmand...”

First “Grope” and then cache the determined error along with the signature.

Problems:– Server must be consistent– Groping is not atomic

Page 25: Systems Seminar Schedule

The Problem ofThe Problem ofToo Much InformationToo Much Information

Page 26: Systems Seminar Schedule

Multiplexes many server backends into one client interface.

Error space is an amalgam of all back end error spaces.

Any call may return any error.

1026 and growing!

UNIX_EPERM -1301UNIX_ENOENT -1302. . .UNIX_EDEADLOCK -1356

HPSS_EPERM -1401HPSS_ENOENT -1402. . .HPSS_NOCOS -1499

MCAT_OPEN_ERROR -3001MCAT_CONNECT_ERROR -3002. . .MCAT_USER_NOT_IN_DOMN -3032

SQL_RSLT_TOO_LONG -1600HTTP_ERR_BAD_PATH -1700

Too Much Info:Too Much Info:SRB RepliesSRB Replies

Page 27: Systems Seminar Schedule

UNIX_EPERM -1301UNIX_ENOENT -1302. . .UNIX_EDEADLOCK -1356

HPSS_EPERM -1401HPSS_ENOENT -1402. . .HPSS_NOCOS -1499

MCAT_OPEN_ERROR -3001MCAT_CONNECT_ERROR -3002. . .MCAT_USER_NOT_IN_DOMN -3032

SQL_RSLT_TOO_LONG -1600HTTP_ERR_BAD_PATH -1700

Too Much Information:Too Much Information:“Divide and Conquer”“Divide and Conquer”

EPERM

ENOENT

ESRCH

EINTR

EIO

EACCESS

EISDIR

OTHER

Page 28: Systems Seminar Schedule

““Appeal to a Higher Power”Appeal to a Higher Power”

VirtualOS

SRBDriver

App

SRBServer

HPSS_NOCOS

open datafile

open datafile

open datafile

Throw an exception.Kill the process.Dump core.

“Cannot assign a COS.”A)bort R)etry F)ail?EACCESS, ENOENT, or EISDIR?

Human

OTHER

Page 29: Systems Seminar Schedule

The Problem ofThe Problem ofNew Failure ModesNew Failure Modes

Page 30: Systems Seminar Schedule

Identify

Certificate

VirtualOS

GSIDriver

App

GSIResource

Protocol Negotiation

open datafile

open datafile ?

EPERM,EACCES,EPROTO...?

Find IdentityAuthentication

Authorization

New Failure New Failure Modes:Modes:

Login ErrorsLogin Errors

GET datafile

Page 31: Systems Seminar Schedule

Hierarchy of error objects, much like Java.

Errors may be identified by individual type or their membership in a more general type.

class Error {Error trigger;Module place_in_code;Object thing_in_question;String message;

};Error

Authen-tication

Author-ization

Commun-ication

NoCreds

ExpiredCreds

NoTrust

New Failure New Failure Modes:Modes:

Login ErrorsLogin Errors

Page 32: Systems Seminar Schedule

New Failure Modes:New Failure Modes:“Take it Easy”“Take it Easy”

Easy for program to interpret and react.

Difficult for a human to debug.

EACCES

No identity

Couldn’tAuthenticate

NotAuthorized

ProtocolNot Supp.

Page 33: Systems Seminar Schedule

New Failure Modes:New Failure Modes:“Split Hairs”“Split Hairs”

Preserves unique error types for the savvy user.

Program may not be prepared to react to arbitrary error values.

EPERM

No identity

Couldn’tAuthenticate

NotAuthorized

ProtocolNot Supp.

EACCES

EPROTO

ESRCH

Page 34: Systems Seminar Schedule

New Failure Modes:New Failure Modes:Rocks SolutionRocks Solution

“Reliable Sockets” by Vic Zandy

Give a general error code along the standard channel.

Give a detailed message along a back channel.

ReliableSockets

StandardSockets

App

ConnectionRefused

ConnectionLost

rserrnoReconnection

TimeoutExpired

Page 35: Systems Seminar Schedule

A Toolbox forA Toolbox forError ConversionsError Conversions

Simple Conversions:– “Take it Easy”– “Split Hairs”– “Divide and Conquer”

“Grope in the Dark”– “Never Forget a Face”

“Appeal to a Higher Power” “Stick your Head in the Sand”

IncreasingCost

Page 36: Systems Seminar Schedule

Error Accuracy can beError Accuracy can beA Performance ConcernA Performance Concern

We can always find some way to produce a correct -- even if undesired -- execution.

But -– An “Appeal to a Higher Power” causes badput.– “Groping in the Dark” yields high latencies.– “Head in the Sand” may keep trying when no automatic

recovery is possible.– ...or, a failure to retry results in unnecessary user

interaction.

Page 37: Systems Seminar Schedule

1 - Express errors in terms of the interface. 2 - Assume the audience is a program.3 - Leave room to expand, but avoid using it.4 - Give the essence, not the details.

Hints forHints forError DesignError Design

Page 38: Systems Seminar Schedule

1 - Express Errors in Terms1 - Express Errors in Termsof the Interfaceof the Interface

Essence of separation of interface and implementation.

The user of an interface should not see a “moving target” as the implementation changes.

Application

FileInterface

DiskImpl

NetworkImpl

MemoryImpl ???

Page 39: Systems Seminar Schedule

2 - Assume the Audience2 - Assume the Audienceis a Programis a Program

A computer-readable error can be used as the basis for a decision at any level.

A human-readable error can only result in a blind retry or an Appeal.

Computer-readable errors are easily made human-readable.

Layer2

Layer 0

Layer 1

Human

ErrorText

???

???

ErrorCode

Decision

Decision

Decision Decision

Page 40: Systems Seminar Schedule

3 - Leave Room to Expand3 - Leave Room to Expand...but Avoid Using It...but Avoid Using It

Any significantly different implementation of an interface will introduce new failure modes.

Possibilities for a new failure:– Best case: fit it into an existing error.– Medium case: return “unknown error.”– Worst case: “Appeal to a Higher Power.”

Page 41: Systems Seminar Schedule

4 - Give the Essence,4 - Give the Essence,not the Detailsnot the Details

The details distract the caller from the nature of the problem and result in cascading “Appeals.”

Example in file systems:– “Fell off the end of the directory linked list.”– or “No file by that name.”

Example in networking:– “Timer went off, but no network interrupt received.’– or “Connection lost.”

Example in security:– “Failure in PEM_do_header while reading password.”– or “You have no credentials.”

A restatement of hint #1.

Page 42: Systems Seminar Schedule

All authors remain anonymous.– “Error in return value.”– “A system call failed!”– “Could not execute job.

Reason: Success”

Hall of Fame

Page 43: Systems Seminar Schedule

In Summary...In Summary...Error management is part of the “art” of

software engineering.The importance and the difficulty of error

management are magnified in a virtual operating system.

All errors have some value, but low-signal errors result in performance problems.

Hints for error interface design.

Page 44: Systems Seminar Schedule

Contact InfoContact Info

Douglas Thain– [email protected]

Software and other info:– http://www.cs.wisc.edu/condor/pfs

Questions and discussion?