22
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain

The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

The CondorData Access Framework

GridFTP / NeST Day31 July 2001

Douglas Thain

The CondorData Access Framework Philosophy Components Organization: Communities Resource Discovery with ClassAds Example Applications Ongoing Work

Philosophy Goal: location-independent execution of jobs

with large I/O needs. Build moderately-sized mechanisms that can

be quickly deployed to existing problems. With experience, explore general-purpose

polcies and larger systems. Priorities:

Reliability and Correctness Throughput (PB/year) … Performance (MB/sec)

Where does Globus fit in? We expect that the Globus protocols

will be the lingua franca of the grid. Condor is committed to speaking

the right language in order to participate.

Like any integration effort, there are some impedance-matching problems in both protocols and APIs.

None are insurmountable.

Components NeST - Network Storage Appliance ReqEx - Scheduled Data Mover Kangaroo - Opportunistic Data Mover Bypass - Adapts Apps to Grid ClassAds - Express Relationships

Others?

NeST

MSSNeSTFTPD

Schedules I/O according to declarations.

ReqEx

Performs I/O as apps request and conditions permit.

BypassAdapts ordinary I/O operations into grid protocols.

ClassAdsExpress relationships and restrictions between participants.

ReqEx

FTPD NeST

Begin with list of jobs and data needs.

Reserve space,Move inputs,Submit jobs,Move outputs.

Scheduled Data Mover

Kangaroo

FTPD NeSTNeST

Move outputs back:During executionAs conditions permitFine-grainedHop-by-hop

Move inputs:On demandShould

cache

Opportunistic Data Mover

Bypass

NeST

Bypass

Creates interposition agents that re-route system calls to other code.

Pluggable File System (PFS): An agent build with Bypass. Presents grid protocols as filesystems.

vi /ftp/coral.cs.wisc.edu/etc/hosts

Organizing Structure:I/O Communities A community is simply a storage appliance

shared by a number of CPUs. Traditional community: distributed file

system. Ordinary users want to restructure

communities according to application and load.

So, communities for grid computing should be easy to set up, reconfigure, and tear down.

NeST + Bypass makes this easy -- use the protocol appropriate for the situation.

I/O Communities

Short-haul I/O

Long-haul I/O GridFTP

Chirp

What Discovery System?

Device Discovery

Replica Discovery If X is not on my disk, where can I find it?

Where is my disk?

Where can I placeMy output now?

If I fetch X, where should I put it so that others can find it?

Everything Together

AgentJob

Device Discovery

Replica Discovery

CPUDiscovery

Execution Site NeST

RemoteStorage

Short-Haul Long-Haul

Resource Discoverywith ClassAds “Classic” ClassAds describe the

properties and requirements of two parties looking for each other.

When expressing I/O communites, there are three parties to a match: jobs, machines, and storage.

By extending the language slightly, we allow jobs to refer to the properties of the attached storage: Requirements = NearestStorage.HasCMSData

Classic ClassAds

MachineMachineJob

JobAd

MachineAd

matc

h

References in ClassAds

MachineMachine NeSTJob

JobAd

MachineAd

StorageAd

matc

h

Refers toNearestStorage.

Knows whereNearestStorage is.

ClassAd ExampleJob Ad:

Type = “Job”Cmd = “cmsim.exe”Owner = “thain”

Requirements = (OpSys==LINUX)

&&(NearestStorage.HasCMS)

Machine Ad:

Type = “Machine”Name = “vulture”OpSys = “Linux”

Requirements =(Owner==“thain”)

NearestStorage = (Type==“Storage”)

&&(Name==“turkey”)

Storage Ad:

Type = “Storage”

Name = “turkey”

HasCMS = True

CMSPath = “/cms”

Notes on ClassAds Every match is a hint

Participants must verify in claiming phase.

Storage: If dataset is missing, abort process and roll back.

Reference feature is new - Condor 6.3 A variation on ‘gang-matching’ as

described by Raman, et. al.

Example Applications I/O Communities:

Applied to CMS simulation codes running at INFN and UW. Unmodified apps retrieve calibration data from nearest NeST.

Kangaroo Applied to Gaussian codes running at NCSA. Users

get progressive output when possible, but network failures don’t stop output.

Same idea applied to CMS reconstruction at INFN. (Older work called Grid Console.)

ReqEx In testing mode on CMS reconstruction at UW.

Ongoing Work Move jobs to data or vice versa?

We can easily build communities for a particular application. Can we build software that works reasonably well in any situation?

Select staging or remote I/0? Depends on number of jobs, storage

capacity, network capacity, etc… Integration with replica management.

Is the App->NeST channel collection aware?

Upcoming Publications Thain, Basney, Chang, Livny, “The

Kangaroo Approach to Data Movement on the Grid”, HPDC 10.

Thain, Bent, Livny, Arpaci-Dusseau, Arpaci Dusseau, “Gathering at the Well: Creating Communities for Grid I/O” - Supercomputing 2001.