30
The Ethernet Approach to Grid Computing Douglas Thain and Miron Livny Condor Project, University of Wisconsin http://www.cs.wisc.edu/condor/ftsh

The Ethernet Approach to Grid Computing Douglas Thain and Miron Livny Condor Project, University of Wisconsin

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

The Ethernet Approachto Grid Computing

Douglas Thain and Miron Livny

Condor Project, University of Wisconsin

http://www.cs.wisc.edu/condor/ftsh

The UWUS-CMS

Physics Grid

Gridmanager(C++)

Condor-G(C++)

Gatekeeper(C)

DAGMan(C++)

GAHP Server(C++)

Batch Interface(bash)

Impala wrapper(bash)

Actual Job(Fortran)

Jobmanager(C)

Batch System(???)

MOP wrapper(bash)

Submit DAG(perl)

MOP(python)

Impala(bash)

MCRunJob(python)

Wrapper

globus-url-copy(C)

try for 30 minutes

...

end

Outline

• Two problems in real systems:– Timing is uncontrollable.– Failures lack detail.

• A solution:– The Ethernet Approach.

• A language and a tool:– The Fault Tolerant Shell.– Time and failures are explicit.

• Example Applications:– Shared Job Queue.– Shared Disk Buffer.– Shared Data Servers.

Ethernet Carrier Sense

Collision Detect

Exponential Backoff

Limited Allocation

Client Client Client Client

WWWServer

Client

WWWServer

BlackHole

dataset dataset

1 - Timing is Uncontrollable

• Consider a distributed file system.

• Suppose that the network is down.– “soft mounted” - failure after one minute– “hard mounted” – failure never exposed

• Time is an unknown in nearly every operating system activity:– Process invocation.– Memory access.– Network communications.

2 - Failures Lack Detail

• Consider this trivial program:

• We would like to distinguish:– “success.”– “file not found.”– “nfs server down, still trying.”– “couldn’t find library libc.so.25.”

% cp a b

2 - Failures Lack Detail

• Consider this trivial program:

• Actual results:– “success.” (exit code 0)– “file not found.” (exit code 1)– “nfs server down, still trying.” (code 1)– “couldn’t find library libc.so.25.” (code 1)

% cp a b

Examples Abound!

• TCP connect -> ECONNREFUSED– Wrong port number.– A loaded service is rejecting connections.– The machine has just rebooted, has initialized

TCP/IP, but not yet started the service.

• FTP RETR -> code 550– “550 File or directory not found.”– “550 Erlaubnis hat verweigert.”– “550 Archiveer systeem offline.”– “550 Fuori di memoria.”– “550 File staging in from tape.” (NCSA Unitree)

Not enoughinformation or control.

Not enoughinformation or control.

Real systems have these

problems. How can we learn to live with them?

“Ethernet Approach”HPDC 2003

Real systems have these

problems. How can we learn to live with them?

“Ethernet Approach”HPDC 2003

How do we design new

systems thatavoid these problems?

“Error Scope”HPDC 2002

How do we design new

systems thatavoid these problems?

“Error Scope”HPDC 2002

The Ethernet Approach

Networkor Memory

or Disk Spaceor OS Resources

Ethernet RulesCarrier Sense

Collision Detect

Exponential Backoff

Limited Allocation

No Carrier Sense== Aloha Protocol

The Fault Tolerant Shell

• A tool that encourages the Ethernet approach in system integration.– Similar to the Bourne or C-Shells.– Process invocation and repetition are simple.– Other elements are possible but ugly.

• Not meant to be general purpose, high performance, or abstractly beautiful.– Not OOP, AOP, SOP, GP, etc...– Ethernet ideas could be used in such languages.

• Elements:– Brittle property, try/catch, timed try, forany/forall.

The Brittle Property

wget http://host/file.tar.gz

gunzip file.tar.gz

tar xvf file.tar

Failure of any step causes an immediate halt of the entire group.

Untyped Exceptions

try

wget http://host/file.tar.gz

gunzip file.tar.gz

tar xvf file.tar

catch

echo “Zoiks!”

end

Exceptions have no type!

Failure of this group raises an exception.

Timed Try Statements

try for 30 minutes

wget http://host/file.tar.gz

gunzip file.tar.gz

tar xvf file.tar

end An exception in the enclosed statement will retry up to 30 mins.(Exp. backoff.)

The enclosed statement will be cancelled after 30 mins.

Success after n is as good as success after one. (Otherwise, failure.)

Timed Try Statements

• If group completes within time limit.– Try block succeeds.

• If group fails within time limit.– Automatically retried.– Exponentially increasing delay.– Random factor to avoid collisions.

• If group runs over time limit.– Resources reclaimed, exception thrown.

forany and forall

forany host in xxx yyy zzz

wget http://${host}/file

end

Attempt to make this statement succeed for any random branch.

forall host in xxx yyy zzz

wget http://${host}/file

end

Attempt to make this statement

succeed for all branches

simultaneously.

Example Applications

Job Queue

Disk Buffer

Data Servers

Collision Detect

failed cmd failed cmd failed cmd

Exp

Backoff

“try” backoff “try” backoff “try” backoff

Limited Allocation

“try” timeout “try” timeout “try” timeout

Carrier Sense

File Descriptors

Estimated Free Space

Short Active ProbeE

ther

net P

rope

rtie

s

handledby coder

handledby ftsh

LocalFilesystem

Shared Job Queue

Condorschedd

JobJobJobJobJobJobJobJob

JobQueue

ActivityLog

MatchMaker

CPU

CPU

CPU

Client

Client

Client

Multiple clients connect to a job queue to manipulate jobs.(Submit, query, remove, etc.) What’s the bottleneck?

Aloha Client

try for 5 minutes

condor_submit job.file

end

Ethernet Client

try for 5 minutes

if avail_fds() .lt. 1000

failure

end

condor_submit job.file

end

Measurefree filedescriptors.

Throw anexception and try again.

Shared Disk Buffer

d5.c d6.cd7.c d9.i

DataMover

Local FileSystem

Step E:Send

Job 8 Job 9 Job 10

d10.id8.i

Step C:Commit

Step D:Read

d4.c

Step F:Delete

Step B:Write

Step A:Arbitrate

Multiple batch jobs share an output buffer.Jobs write output files, and a mover pushes them out.

Aloha Client

try for 30 minutes

try

run-job > d$n.i

mv d$n.i d$n.c

catch

rm -f d$n.i

end

endRemove the file if any failure.

Create the file, marked “incomplete.”

Atomically commit the file.

Ethernet Client

try for 30 minutesif overcommitted()

failureendtry

run-job > d$n.imv d$n.i d$n.c

catchrm -f d$n.i

endend

Buffer is overcommitted ifestimated needs exceed available space.

Shared Data Servers

Client Client Client Client

WWWServer

Client

WWWServer

BlackHoledataset dataset

Accepts all connections and holds them idle indefinitely.

A healthy but loaded server

might also have a high response

time.

Each client wants one instance of the data set, but doesn’t carewhich one. How to deal with delays and failures?

Aloha Clienttry for 15 minutes

forany host in xxx yyy zzztry for 1 minute

wget http://${host}/dataend

endend

Ethernet Clienttry for 15 minutes

forany host in xxx yyy zzztry for 5 seconds

wget http://${host}/tinyendtry for 1 minute

wget http://${host}/dataend

endend

Test the server by fetching a tiny file.

All ClientsBlocked onBlack Hole

Some Thoughts• This is a necessary technique for real problems.

– Timing is uncontrollable; failures lack detail.– A simple technique has significant payoff.

• The Ethernet approach is not always ideal.– Carefully chosen errnos are powerful.– Designing errnos is tricky.

• Requires clients of good will.– Some scenarios require external coordination.– Admission control for admission control?

• Time and failure are first-class concerns.– They should be first-class elements of languages!– We get good mileage without complex constructions.

• More info at:– http://www.cs.wisc.edu/condor/ftsh

Computing’s central challenge,“How not to make a mess of it,”

has not yet been met.

-Edsger Dijkstra