35
Errors, Status, Errors, Status, and Asynchrony and Asynchrony Discussion Session Discussion Session PPDG Data Replication Meeting 10 January 2002 Douglas Thain, Condor Project University of Wisconsin

Errors, Status, and Asynchrony Discussion Session PPDG Data Replication Meeting 10 January 2002 Douglas Thain, Condor Project University of Wisconsin

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Errors, Status,Errors, Status,and Asynchronyand Asynchrony

Discussion SessionDiscussion SessionPPDG Data Replication Meeting

10 January 2002

Douglas Thain, Condor Project

University of Wisconsin

AgendaAgendaA Working ModelTwo Error-Management Issues

– Thinking of Data-Movement as “Jobs”– Reconciling Error Representations

Example ProblemDiscussionOpen Issues:

– Hints and Absolutes in Replica Management– Tradeoff between consistency and availability

Discussion PointsDiscussion Points Data Job Management and Fault-Tolerance

– What faults do we intend to tolerate/expose/ignore?– Can we develop a general transaction infrastructure for

replication-related activities?– How should we evaluate designs that may be error sensitive?

(design review, stress testing)

Error Identification and Representation

– Should we have a uniform error space?– Is it feasible to translate between existing error spaces?– What systems have unusual errors modes that outsiders may not

expect?– How do we deal with unusual errors that must pass through

existing APIs?

GRIN

Replica Site A Replica Site B

L1 P1L2L3

P2P3

L1 BL2L3

BB

A Working Model: GiggleA Working Model: Giggle

Foster, Iamnitchi, Ripeanu, Chervenak, Deelman, Kesselman, Hoschek, Kunszt, Stockinger, Stockinger, Tierney, “Giggle: A Framework for Constructing Scalable Replica Location Services”

The ProblemThe Problem

Replication systems will be subject to a wide variety of errors.

How do we build systems that maintain consistency in the face of errors?

– Answer: Use transactions to manage jobs, but...How do we build systems that make reasonable

performance decisions in the face of errors?

– Answer: Informative errors, but…

Fault Tolerance TerminologyFault Tolerance Terminology

Failure– An externally-visible deviation from

specifications.

Error– An internal data state that leads to a failure.

Fault– An external event that creates an error.

A. Avizienis and J.C. Laprie, Dependable computing: From concepts to design diversity, Proc IEEE 74, 5 (May) 629-638

ExampleExample

Client Server

What is sqrt(4)?Hmm, sqrt(4) is...

Hmm, sqrt(9) is...Answer: 3

ERRORFAILURE

FAULT

Silent errors (failures)– The system claims to have reached a valid result, but an

auditor claims it is invalid.

Explicit errors (failures)– The system tells us it cannot complete the desired action.

Escaping errors (failures)– The system detects an error, but has no method of

reporting it, so it escapes by an alternate route -- drop connection, core dump, kernel panic. (exception)

John B. Goodenough, Exception Handling: Issues and a Proposed Notation, CACM 18:22 (1975), pp 683-696.

What Errors to Expect in a What Errors to Expect in a Replication System?Replication System?

Errors of communication:– File transfer was broken between bytes.– Collection transfer was broken between files.

Errors of omission:– Requested some files, but response was slow, so the

caller gave up and left. (with/out abort?)

Errors in configuration:– Space at target server can’t admit all incoming data

at once.

Replica Catalog

Replica Site A Replica Site B

L1 P1L2L3

P2P3

L1 BL2L3

BB

What Must Be Consistent?What Must Be Consistent?

P3

P1P2

Index of files and the files themselves must be kept

consistent

Giggle does not require that a GRIN be up-to-date, but it is

useful to consider.

Data Movement as a JobData Movement as a Job

Each request issued for replication must have a past, present, and future:– Who issued it, and why?– What is it doing now?– Is it done? Did it succeed?– Enough information to roll back after a failure.

A complete program execution:– data jobs + cpu jobs + dependencies =

DAGMan/DaPMan

Job Management Job Management Primary technique for reliable interacting with the job

queue: transaction.ACID Test: Atomicity, Consistency, Isolation,

Durability.Of course, the natural interface to a db, but not all

participants are a full db.– Interface:

2PL and friends

– Implementation: Logging, shadowing, a real db?

Two-Phase CommitTwo-Phase Commit

id or failure

commit(tid)

ok

StableStorage

Work Space

Archival Space

Client Server

prepare(data)

tid

J. Eliot Moss, Nested Transactions: An Approach to Reliable Distributed Computing, MIT Press, 1985.

StableStorage

Work Space

Two-Phase CommitTwo-Phase Commit

begin()

tid

commit(tid)Archival Space

Client Server

James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven Tuecke, "Condor-G: A Computation Management Agent for Multi-Institutional Grids", Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10), 2001.

add(tid,data)

ok

ok

end(tid)

ok

PR

EP

AR

EC

OM

MIT

tid

Transactions and StatusTransactions and Status

The transaction ID then becomes a persistent “job number” for later queries:– Success, failure, abort, timeout…– unknown-past, unknown-future.

For this status to be useful, a record of the job must be kept around for a certain period of time.

Also ok to time out, cancel, or otherwise remove data movement jobs.

But, a committed transaction must be kept. Can’t re-use a job number!

Transaction ImplementationsTransaction ImplementationsLogging

– Keep a log of all actions, new and old values.– Read forward to redo, backwards to undo.

Shadowing– Add changed data to unallocated space.– Atomically commit new pointers to data.

D D

D D

M

D D D

Atomic pointer update

Transaction ImplementationsTransaction ImplementationsIf a standard file system is the underlying

storage, then shadowing is a natural fit.– Most metadata updates are designed to be

atomic and synchronous.– Most large data updates are designed to provide

good xput, but are asynchronous and not guaranteed until after an explicit commit.

Atomic File UpdateAtomic File Update

fd = creat(“file.tmp”) write(fd,data,length)fsync(fd)close(fd)rename(“file.tmp”,”file”)

unlink(“file.tmp”) unlink(“*.tmp”)

(Technique used on Condor checkpoint servers and scheduler processes.)

On Success On Failure or abort

On reboot

Done.

Unifying Storage ServicesUnifying Storage Services

Virtual Operating System

POSIX

App

UNIXDriver

SRBDriver

GridFTPDriver

NeSTDriver

KangarooDriver

GASSDriver

An Alphabet Soup of Protocols, APIs, Systems, Authorities, and

Authors

Error Error Representation:Representation:

A ProblemA Problemof Depthof Depth

BypassAgent

App

ReplicaAccessLibrary

ReplicaServer

ReplicaCatalog

Replica Server

POSIX

RM

P

RAP

PPDG API

DiskCache

TapeArchive

Win32

???

RM

P

FTPServer

FTP

A Problem ofA Problem ofDesign DirectionDesign Direction

BottomUp

Design

App

ApplicationLibrary

StandardLibrary

OSKernel

POSIX

ANSI

???

App

Virtual OS

ReplicaAccess

ReplicaServer

SRB

PPDG API

POSIX

Outside In

Design

The End-to-End ArgumentThe End-to-End Argument

In complex software, the outermost layer has the ultimate responsibility for interpreting and recovering from errors.

Recovery in a lower layer is an optimization of performance or convenience.

If the possibility of error is very high, lower-level recovery is needed for good performance.

Saltzer, Reed, and Clark, End-to-End Arguments in System Design, Computer Systems 2:4, pp 277-288, 1984.

UNIX ErrnosUNIX Errnos

A single namespace of integer errors that apply to all levels of the system.

Any call is free to return any possible error. (124)

General vs specific:– ENOENT vs ECHILD

Some artifacts:– EACCESS vs EPERM– EADV and EDOTDOT

EPERM 1 /* Operation not permitted */ENOENT 2 /* No such file or directory */ESRCH 3 /* No such process */EINTR 4 /* Interrupted system call */EIO 5 /* I/O error */ENXIO 6 /* No such device or address */E2BIG 7 /* Arg list too long */ENOEXEC 8 /* Exec format error */EBADF 9 /* Bad file number */ECHILD 10 /* No child processes */EAGAIN 11 /* Try again */ENOMEM 12 /* Out of memory */EACCES 13 /* Permission denied */..

FTP Reply CodesFTP Reply Codes

Integer codes indicate the severity of a response to an action.

Many transfer problems are identified, but few file system problems are.

Third digit specified infrequently, and for wide classes of errors.

100 - Positive Preliminary

200 - Positive Completion

300 - Positive Intermediate

400 - Transient Negative

500 - Permanent negative

000 - Syntax

010 - Information

020 - Connections

030 - Authentication

040 - Unspecified

050 - File System

550: “e.g. File not found, no access”

Error space is an amalgam of all back end error spaces.

Pros: No information is ever lost in translation.

Cons: Very difficult to write code that switches on the error number (1026 cases.)

UNIX_EPERM -1301UNIX_ENOENT -1302. . .UNIX_EDEADLOCK -1356

HPSS_EPERM -1401HPSS_ENOENT -1402. . .HPSS_NOCOS -1499

MCAT_OPEN_ERROR -3001MCAT_CONNECT_ERROR -3002. . .MCAT_USER_NOT_IN_DOMN -3032

SQL_RSLT_TOO_LONG -1600

HTTP_ERR_BAD_PATH -1700

SRB Reply CodesSRB Reply Codes

Pros:– Errors may be

identified at varying levels of granularity.

– Easily expandable.– Lots of debug info.

Cons:– Can be difficult to

decide in which class to place an external error.

– In practice, most errors are returned as objects of type “string”.

Error

Authen-tication

Author-ization

Commun-ication

NoCreds

ExpiredCreds

NoTrust

Globus Error ObjectsGlobus Error Objects

String

UNIX_EPERM -1301UNIX_ENOENT -1302. . .UNIX_EDEADLOCK -1356

HPSS_EPERM -1401HPSS_ENOENT -1402. . .HPSS_NOCOS -1499

MCAT_OPEN_ERROR -3001MCAT_CONNECT_ERROR -3002. . .MCAT_USER_NOT_IN_DOMN -3032

SQL_RSLT_TOO_LONG -1600

HTTP_ERR_BAD_PATH -1700

Translation Can be Done…Translation Can be Done…to a Pointto a Point

EPERM

ENOENT

ESRCH

EINTR

EIO

EACCESS

EISDIR

OTHER

Grope in the DarkGrope in the Dark

if GET succeedsreturn success

elseif CHDIR succeeds

return EISDIR

elseif LIST succeeds

return EACCESS

elsereturn ENOENT

end

end

end

GET

CHDIR

LIST

EACCESS

Error Identification isError Identification isa a PerformancePerformance Concern Concern

We can always find some way to produce an execution that avoids a silent failure.– Pass all errors up one level.– Retry all errors until time expires.– Abort process completely.

But, a known, finite, space allows the caller to make targeted decisions about what to do next:– “Not Authorized” -- best to pass up one level.– “Operation Interrupted” -- best to retry here.

Give the Essence orGive the Essence orGive the Details?Give the Details?

Example in file systems:– “Fell off the end of the directory linked list.”– or “No file by that name.”

Example in networking:– “Timer went off, but no network interrupt received.’– or “Connection lost.”

Example in security:– “Failure in PEM_do_header while reading password.”– or “You have no credentials.”

Example in Storage:– HPSS_NOCOS– or ?????

Example and DiscussionExample and Discussion

ExampleExample

Goal:– User requests a repl of a file from B to A.

Data Structures at each Node:– A persistent map map from LFNs to PFNs.– A persistent store for transactions.– A persistent store for data.

Assumptions:– Files are read-only, no need for invalidation.– All nodes must survive reboot cleanly.– File transfers may be resumed from any point.

Replica Site A Replica Site B

L1 P1L2L3

P2P3

Client

I want LFN 2

Get LFN 2

Got it.

Replica Catalog

L1 BL2L3

BB

Where is LFN 2?

At site B.

Replica Site A

LFN TRN

T53.tmp

LFN = L2PFN = P16

State = Working

T53

LFN = L2PFN = P16

State = Working

L2 T53

T53

LFN = L2PFN = P16

State = Done

T53

commit(T53)

ok

T53.tmp

LFN = L2PFN = P16

State = Working

Client

prepare(get L2)

Server

T53.tmp

LFN = L2PFN = P16

State = Done

P16

PhysicalData File

More IssuesMore IssuesCleanup at Reboot:

– Remove uncommitted transactions.– Jobs in progress: Update LFN->TRN entry.

Client Status Check:– Requesting client examines state of transaction.– Or, other clients indirect through LFN entry.

Notification of Status Change:– Unreliable -- Server sends messages to client.– Reliable --Server must do transaction to client.

(See Condor-G Paper)