39
Peter F. Couvares (based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber) Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison <[email protected]> STORK & NeST: Making Data Placement a First Class Citizen in the Grid

Peter F. Couvares (based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber) Associate Researcher, Condor Team Computer Sciences Department University

Embed Size (px)

Citation preview

Peter F. Couvares(based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber)

Associate Researcher, Condor TeamComputer Sciences DepartmentUniversity of Wisconsin-Madison<[email protected]>

STORK & NeST: Making Data Placement a First Class Citizen in the Grid

Reliable and Efficient Grid Data Placement 2www.cs.wisc.edu/condor

Need to move data around..

Reliable and Efficient Grid Data Placement 3www.cs.wisc.edu/condor

While doing this..› Locate the data

› Access heterogeneous resources

› Recover form all kinds of failures

› Allocate and de-allocate storage

› Move the data

› Clean-up everything

All of these need to be done reliably and efficiently!

Reliable and Efficient Grid Data Placement 4www.cs.wisc.edu/condor

Stork

A scheduler for data placement activities in the GridWhat Condor is for computational jobs, Stork is for data placement Stork’s fundamental concept:“Make data placement a first class citizen in

the Grid.”

Reliable and Efficient Grid Data Placement 5www.cs.wisc.edu/condor

Outline

IntroductionThe ConceptStork FeaturesBig PictureConclusions

Reliable and Efficient Grid Data Placement 6www.cs.wisc.edu/condor

The Concept

• Stage-in

• Execute the Job

• Stage-out

Individual Jobs

Reliable and Efficient Grid Data Placement 7www.cs.wisc.edu/condor

The Concept

• Stage-in

• Execute the Job

• Stage-out

Stage-in

Execute the job

Stage-outRelease input space

Release output space

Allocate space for input & output data

Individual Jobs

Reliable and Efficient Grid Data Placement 8www.cs.wisc.edu/condor

The Concept

• Stage-in

• Execute the Job

• Stage-out

Stage-in

Execute the job

Stage-outRelease input space

Release output space

Allocate space for input & output data

Data Placement Jobs

Computational Jobs

Reliable and Efficient Grid Data Placement 9www.cs.wisc.edu/condor

DAGMan

The Concept

CondorJob

QueueData A A.submitData B B.submitJob C C.submit…..Parent A child BParent B child CParent C child D, E…..

C

StorkJob

Queue

E

DAG specification

A CBD

E

F

Reliable and Efficient Grid Data Placement 10www.cs.wisc.edu/condor

Why Stork?

Stork understands the characteristics and semantics of data placement jobs.Can make smart scheduling decisions, for reliable and efficient data placement.

Reliable and Efficient Grid Data Placement 11www.cs.wisc.edu/condor

Understanding Job Characteristics &

SemanticsJob_type = transfer, reserve, release?Source and destination hosts, files, protocols to use?• Determine concurrency level • Can select alternate protocols• Can select alternate routes• Can tune network parameters (tcp buffer size, I/O

block size, # of parallel streams)• …

Reliable and Efficient Grid Data Placement 12www.cs.wisc.edu/condor

Support for Heterogeneity

Protocol translation using Stork memory buffer.

Reliable and Efficient Grid Data Placement 13www.cs.wisc.edu/condor

Support for Heterogeneity

Protocol translation using Stork Disk Cache.

Reliable and Efficient Grid Data Placement 14www.cs.wisc.edu/condor

Flexible Job Representation

[ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”;

…………

]

Reliable and Efficient Grid Data Placement 15www.cs.wisc.edu/condor

Failure Recovery and Efficient Resource

UtilizationFault tolerance• Just submit a bunch of data placement jobs, and then

go away..

Control number of concurrent transfers from/to any storage system• Prevents overloading

Space allocation and De-allocations• Make sure space is available

Reliable and Efficient Grid Data Placement 16www.cs.wisc.edu/condor

Outline

IntroductionThe ConceptStork FeaturesBig PictureConclusions

JOB DESCRIPTION

S

USERPLANNE

R Abstract DAG

JOB DESCRIPTION

S

PLANNER

USER

WORKFLOW MANAGER

Abstract DAG

Concrete DAG

RLS

JOB DESCRIPTION

S

PLANNER

DATA PLACEMENT SCHEDULER

COMPUTATION SCHEDULER

USER

GRID STORAGE SYSTEMS

WORKFLOW MANAGER

Abstract DAG

Concrete DAG

RLS / MCAT

GRID COMPUTE NODES

JOB DESCRIPTION

S

PLANNER

DATA PLACEMENT SCHEDULER

COMPUTATION SCHEDULER

USER

GRID STORAGE SYSTEMS

WORKFLOW MANAGER

Abstract DAG

Concrete DAG

RLS / MCAT

GRID COMPUTE NODES

D. JOB LOG FILES

C. JOB LOG FILES

POLICY ENFORCER

JOB DESCRIPTION

S

PLANNER

DATA PLACEMENT SCHEDULER

COMPUTATION SCHEDULER

USER

GRID STORAGE SYSTEMS

WORKFLOW MANAGER

Abstract DAG

Concrete DAG

RLS / MCAT

GRID COMPUTE NODES

D. JOB LOG FILES

C. JOB LOG FILES

POLICY ENFORCER

DATA MINER

NETWORK MONITORING TOOLS

FEEDBACK MECHANISM

JOB DESCRIPTION

S

PLANNER

STORKCONDOR/ CONDOR-G

USER

GRID STORAGE SYSTEMS

DAGMAN

Abstract DAG

Concrete DAG

RLS / MCAT

GRID COMPUTE NODES

D. JOB LOG FILES

C. JOB LOG FILES

MATCHMAKER

DATA MINER

NETWORK MONITORING TOOLS

FEEDBACK MECHANISM

Reliable and Efficient Grid Data Placement 23www.cs.wisc.edu/condor

ConclusionsRegard data placement as individual jobs.Treat computational and data placement jobs differently.Introduce a specialized scheduler for data placement.Provide end-to-end automation, fault tolerance, run-time adaptation, multilevel policy support, reliable and efficient transfers.

Reliable and Efficient Grid Data Placement 24www.cs.wisc.edu/condor

Future work

Enhanced interaction between Stork and higher level planners• better coordination of CPU and I/O

Interaction between multiple Stork servers and job delegation from one to anotherEnhanced authentication mechanismsMore run-time adaptation

Reliable and Efficient Grid Data Placement 25www.cs.wisc.edu/condor

Related PublicationsTevfik Kosar and Miron Livny. “Stork: Making Data Placement a First Class Citizen in the Grid”. In Proceedings of 24th IEEE Int. Conference on Distributed Computing Systems (ICDCS 2004), Tokyo, Japan, March 2004.George Kola, Tevfik Kosar and Miron Livny. “A Fully Automated Fault-tolerant System for Distributed Video Processing and Off-site Replication. To appear in Proceedings of 14th ACM Int. Workshop on etwork and Operating Systems Support for Digital Audio and Video (Nossdav 2004), Kinsale, Ireland, June 2004.Tevfik Kosar, George Kola and Miron Livny. “A Framework for Self-optimizing, Fault-tolerant, High Performance Bulk Data Transfers in a Heterogeneous Grid Environment”. In Proceedings of 2nd Int. Symposium on Parallel and Distributed Computing (ISPDC 2003), Ljubljana, Slovenia, October 2003.George Kola, Tevfik Kosar and Miron Livny. “Run-time Adaptation of Grid Data Placement Jobs”. In Proceedings of Int. Workshop on Adaptive Grid Middleware (AGridM 2003), New Orleans, LA, September 2003.

Reliable and Efficient Grid Data Placement 26www.cs.wisc.edu/condor

You don’t have to FedEx your data

anymore.. Stork delivers it for

you!For more information:

• Email: [email protected]• http://www.cs.wisc.edu/condor/stork

Reliable and Efficient Grid Data Placement 27www.cs.wisc.edu/condor

NeST

NeST (Network Storage Technology)A lightweight, portable storage manager for data placement activities on the GridAllocation: NeST negotiates “mini storage contracts” between users and server.Multi-protocol: Supports Chirp, GridFTP, NFS, HTTP

Chirp is NeST’s internal protocol.

Secure: GSI authenticationLightweight: Configuration and installation can be performed in minutes, and does not require root.

www.cs.wisc.edu/condor

Why storage allocations ?

› Users need both temporary storage, and long-term guaranteed storage.

› Administrators need a storage solution with configurable limits and policy.

› Administrators will benefit from NeST’s automatic reclamations of expired storage allocations.

www.cs.wisc.edu/condor

Storage allocations in NeST

› Lot – abstraction for storage allocation with an associated handle• Handle is used for all subsequent operations

on this lot

› Client requests lot of a specified size and duration. Server accepts or rejects client request.

www.cs.wisc.edu/condor

Lot types› User / Group

• User: single user (user controls ACL)• Group: shared use (all members control ACL)

› Best effort / Guaranteed• Best effort: server may purge data if necessary. Good

fit for derived data.• Guaranteed: server honors request duration.

› Hierarchical: Lots with lots (“sublots”)

www.cs.wisc.edu/condor

Lot operations

› Create, Delete, Update› MoveFile

• Moves files between lots

› AddUser, RemoveUser• Lot level authorization• List of users allowed to request sub-lots

› Attach / Detach• Performs NeST lot to path binding

www.cs.wisc.edu/condor

Functionality: GT4 GridFTP

GridFTP Server

Disk Module

Disk

Storage

globus-url-copy

Sample Application

(GSI-FTP)

www.cs.wisc.edu/condor

NeST Server

Functionality: GridFTP +NeST

GridFTP Server

NeST Module

Disk

Storage

NeST Client

Chirp

Handler

globus-url-copy

(Lot operations, etc.)(File transfers)

(chirp)(GSI-FTP)

(File transfer)(chirp)

www.cs.wisc.edu/condor

NeST Server

NeST with Stork

GridFTP Server

NeST Module

Stork

(Lot operations, etc.)(chirp)

(File transfers)(GSI-FTP)

(File transfer)(chirp)

GT4 NeST

Disk

Storage

Chirp

Handler

www.cs.wisc.edu/condor

Stork

NeST Sample Work DAG

NeST

Allocate

Job

Job

Job

Xfer In Xfer Out Release

Stork

Condor-G

www.cs.wisc.edu/condor

Connection Manager

› Used to control connections to NeST

› Allows connection reservations:• Reserve # of simultaneous connections• Reserve total bytes of transfer• Reservations have durations & expire• Reservations are persistent

www.cs.wisc.edu/condor

Release Status

› http://www.cs.wisc.edu/condor/nest› v0.9.7 expected soon

• v0.9.7 Pre 2 released• Just bug fixes; features frozen

› v1.0 expected later this year› Currently supports Linux, will support other

O/S’s in the future.

www.cs.wisc.edu/condor

Roadmap

› Performance tests with Stork

› Continue hardening code base

› Expand supported platforms• Solaris & other UNIX-en

› Bundle with Condor

› Connection Manager

www.cs.wisc.edu/condor

Questions ?

› More information available at http://www.cs.wisc.edu/condor/nest