58
Separating Separating Abstractions from Abstractions from Resources in a Resources in a Tactical Storage Tactical Storage System System Douglas Thain Douglas Thain University of Notre University of Notre Dame Dame http://www.nd.edu/~ccl http://www.nd.edu/~ccl

Separating Abstractions from Resources in a Tactical Storage System Douglas Thain University of Notre Dame ccl

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Separating Abstractions Separating Abstractions from Resources in a from Resources in a

Tactical Storage SystemTactical Storage System

Douglas ThainDouglas Thain

University of Notre DameUniversity of Notre Dame

http://www.nd.edu/~cclhttp://www.nd.edu/~ccl

AbstractAbstractUsers of distributed systems encounter many Users of distributed systems encounter many practical barriers between their jobs and the data practical barriers between their jobs and the data they wish to access.they wish to access.

Problem: Users have access to many Problem: Users have access to many resourcesresources (disks), but are stuck with the (disks), but are stuck with the abstractionsabstractions (cluster NFS) provided by administrators.(cluster NFS) provided by administrators.

Solution: Tactical Storage Systems allow any Solution: Tactical Storage Systems allow any user to create, reconfigure, and tear down user to create, reconfigure, and tear down abstractions without bugging the administrator.abstractions without bugging the administrator.

Transparent Distributed Filesystemshared

disk

The Standard ModelThe Standard Model

The Standard ModelThe Standard Model

Transparent Distributed Filesystemshared

disk

Transparent Distributed Filesystemshared

disk

privatedisk

privatedisk

privatedisk

privatedisk

FTP, SCP, RSYNC, HTTP, ...

Problems with the Standard ModelProblems with the Standard Model

Users encounter partitions in the WAN.Users encounter partitions in the WAN.– Easy to access data inside cluster, hard outside.Easy to access data inside cluster, hard outside.– Must use different mechanisms on diff links.Must use different mechanisms on diff links.– Difficult to combine resources together.Difficult to combine resources together.

Resources go unused.Resources go unused.– Disks on each node of a cluster.Disks on each node of a cluster.– Unorganized resources in a department/lab.Unorganized resources in a department/lab.

Unnecessary cross-talk between users.Unnecessary cross-talk between users.– User A demands async NFS for performance.User A demands async NFS for performance.– User B demands sync NFS for consistency.User B demands sync NFS for consistency.

A global file system is not possible!A global file system is not possible!

What if...What if...

Users could easily access any storage? Users could easily access any storage?

I could borrow an unused disk for NFS?I could borrow an unused disk for NFS?

An entire cluster can be used as storage?An entire cluster can be used as storage?

Multiple clusters could be combined?Multiple clusters could be combined?

I could reconfigure structures without root?I could reconfigure structures without root?– (Or bugging the administrator daily.)(Or bugging the administrator daily.)

Solution: Tactical Storage System (TSS)Solution: Tactical Storage System (TSS)

OutlineOutline

Problems with the Standard ModelProblems with the Standard ModelTactical Storage SystemsTactical Storage Systems– File Servers, Catalogs, Abstractions, AdaptersFile Servers, Catalogs, Abstractions, Adapters

Applications:Applications:– Remote Dynamic Linking in HEP SimulationRemote Dynamic Linking in HEP Simulation– Remote Database Access in HEP SimulationRemote Database Access in HEP Simulation– Expandable Filesystem for Experimental DataExpandable Filesystem for Experimental Data– Expandable Database for Bioinformatics SimulationExpandable Database for Bioinformatics Simulation

Ongoing WorkOngoing Work– Malloc, Dynamic Views, DACLs, PINSMalloc, Dynamic Views, DACLs, PINS

Final ThoughtFinal Thought

Tactical Storage Systems (TSS)Tactical Storage Systems (TSS)

A TSS allows any node to serve as a file A TSS allows any node to serve as a file server or as a file system client.server or as a file system client.All components can be deployed without All components can be deployed without special privileges – but with security.special privileges – but with security.Users can build up complex structures.Users can build up complex structures.– Filesystems, databases, caches, ...Filesystems, databases, caches, ...

Two Independent Concepts:Two Independent Concepts:– ResourcesResources – The raw storage to be used. – The raw storage to be used.– AbstractionsAbstractions – The organization of storage. – The organization of storage.

filesystem

filesystem

filesystem

filesystem

filesystem

filesystem

filesystem

CentralFilesystem

App

Distributed Database Abstraction

Adapter

App

Distributed Filesystem Abstraction

Adapter

App

Cluster administrator controlspolicy on all storage in cluster

UNIX UNIX UNIX UNIX UNIX UNIX UNIX

Workstations owners controlpolicy on each machine.

fileserver

fileserver

fileserver

fileserver

fileserver

fileserver

fileserver

UNIX UNIX UNIX UNIX UNIX UNIX UNIX

???Adapter

Components of a TSS:Components of a TSS:

1 – File Servers1 – File Servers

2 – Catalogs2 – Catalogs

3 – Abstractions3 – Abstractions

4 – Adapters4 – Adapters

1 – File Servers1 – File ServersUnix-Like InterfaceUnix-Like Interface– open/close/read/writeopen/close/read/write– getfile/putfile to stream whole filesgetfile/putfile to stream whole files– opendir/stat/rename/unlinkopendir/stat/rename/unlink

Complete IndependenceComplete Independence– choose friendschoose friends– limit bandwidth/spacelimit bandwidth/space– evict users?evict users?

Trivial to DeployTrivial to Deploy– run server + setaclrun server + setacl– no privilege requiredno privilege required– can be thrown into a grid systemcan be thrown into a grid system

Flexible Access ControlFlexible Access Control

fileserver

A

fileserver

B

ChirpProtocol

filesystemowner of

server Aowner ofserver B

Access Control in File ServersAccess Control in File Servers

Unix Security is not SufficientUnix Security is not Sufficient– No global user database possible/desirable.No global user database possible/desirable.– Mapping external credentials to Unix gets messy.Mapping external credentials to Unix gets messy.

Instead, Make External Names First-ClassInstead, Make External Names First-Class– Perform access control on remote, not local, names.Perform access control on remote, not local, names.– Types: Globus, Kerberos, Unix, Hostname, AddressTypes: Globus, Kerberos, Unix, Hostname, Address

Each directory has an ACL:Each directory has an ACL:globus:/O=NotreDame/CN=DThain RWLAglobus:/O=NotreDame/CN=DThain RWLA

kerberos:[email protected] RWLkerberos:[email protected] RWL

hostname:*.cs.nd.edu RLhostname:*.cs.nd.edu RL

address:192.168.1.* RWLAaddress:192.168.1.* RWLA

Problem: Shared NamespaceProblem: Shared Namespacefile

server

globus:/O=NotreDame/* RWLAX

a.out

test.c test.dat

cms.exe

Solution: Reservation (V) RightSolution: Reservation (V) Rightfile

server

O=NotreDame/CN=* V(RWLA)

/O=NotreDame/CN=Monk RWLA

mkdir

a.outtest.c

/O=NotreDame/CN=Monk

mkdir

/O=NotreDame/CN=Ted RWLA

a.outtest.c

/O=NotreDame/CN=Tedmkdir only!

2 - Catalogs2 - Catalogs

catalogserver

catalogserver

periodicUDP updates

HTTPXML, TXT, ClassAds

3 - Abstractions3 - Abstractions

An abstraction is an organizational layer built on An abstraction is an organizational layer built on top of one or more file servers.top of one or more file servers.

End UsersEnd Users choose what abstractions to employ. choose what abstractions to employ.

Working Examples:Working Examples:– CFS: Central File SystemCFS: Central File System– DSFS: Distributed Shared File SystemDSFS: Distributed Shared File System– DSDB: Distributed Shared DatabaseDSDB: Distributed Shared Database

Others Possible?Others Possible?– Distributed Backup SystemDistributed Backup System– Striped File System (RAID/Zebra)Striped File System (RAID/Zebra)

CFS: Central File SystemCFS: Central File System

fileserver

adapteradapter adapter

appl appl appl

file file

file

CFSCFSCFS

ptr ptr

ptr

DSFS: Dist. Shared File SystemDSFS: Dist. Shared File System

fileserver

appl appl

fileserver

fileserver

file file

filefilefile

file filefile

filefile

adapter adapterDSFSDSFS

lookupfile

location

accessdata

DSDB: Dist. Shared DatabaseDSDB: Dist. Shared Database

adapter adapter

appl appl

fileserver

fileserver

file file

filefilefile

file filefile

filefile

databaseserver

file index

query

directaccess

insert

create

file

DSDBDSDB

system callstrapped via ptrace

tcsh

cat vi

tcsh

cat vi

file tableprocess table

Like an OS KernelLike an OS Kernel– Tracks procs, files, etc.Tracks procs, files, etc.– Adds new capabilities.Adds new capabilities.– Enforces owner’s policies.Enforces owner’s policies.

Delegated SyscallsDelegated Syscalls– Trapped via ptrace interface.Trapped via ptrace interface.– Action taken by Parrot.Action taken by Parrot.– Resources chrgd to Parrot.Resources chrgd to Parrot.

User Chooses Abstr.User Chooses Abstr.– Appears as a filesystem.Appears as a filesystem.– Option: Timeout tolerance.Option: Timeout tolerance.– Option: Cons. semantics.Option: Cons. semantics.– Option: Servers to use.Option: Servers to use.– Option: Auth mechanisms.Option: Auth mechanisms.

4 - Adapter4 - Adapter

Adapter - Parrot

Abstractions:CFS – DSFS - DSDB

filesystem

filesystem

filesystem

filesystem

filesystem

filesystem

filesystem

CentralFilesystem

App

Distributed Database Abstraction

Adapter

App

Distributed Filesystem Abstraction

Adapter

App

Cluster administrator controlspolicy on all storage in cluster

UNIX UNIX UNIX UNIX UNIX UNIX UNIX

Workstations owners controlpolicy on each machine.

fileserver

fileserver

fileserver

fileserver

fileserver

fileserver

fileserver

UNIX UNIX UNIX UNIX UNIX UNIX UNIX

???Adapter

Performance SummaryPerformance SummaryNothing comes for free!Nothing comes for free!– System calls: order of magnitude slower.System calls: order of magnitude slower.– Memory bandwidth overhead: extra copies.Memory bandwidth overhead: extra copies.– TSS can drive network/switch to limits.TSS can drive network/switch to limits.

Compared to NFS Protocol:Compared to NFS Protocol:– TSS slightly better on small operations. (no lookup)TSS slightly better on small operations. (no lookup)– TSS much better in network bandwidth. (TCP)TSS much better in network bandwidth. (TCP)– NFS caches, TSS doesn’t (today), mixed blessing.NFS caches, TSS doesn’t (today), mixed blessing.

On real applications:On real applications:– Measurable slowdownMeasurable slowdown– Benefit: far more flexible and scalable.Benefit: far more flexible and scalable.

OutlineOutline

Problems with the Standard ModelProblems with the Standard ModelTactical Storage SystemsTactical Storage Systems– File Servers, Catalogs, Abstractions, AdaptersFile Servers, Catalogs, Abstractions, Adapters

Applications:Applications:– Remote Dynamic Linking in HEP SimulationRemote Dynamic Linking in HEP Simulation– Remote Database Access in HEP SimulationRemote Database Access in HEP Simulation– Expandable Filesystem for Astrophysics DataExpandable Filesystem for Astrophysics Data– Expandable Database for Mol. Dynamics SimulationExpandable Database for Mol. Dynamics Simulation

Ongoing WorkOngoing Work– Malloc, Dynamic Views, DACLs, PINSMalloc, Dynamic Views, DACLs, PINS

Final ThoughtsFinal Thoughts

Remote Dynamic LinkingRemote Dynamic Linking

appl

adapter

ld.so FTPserver

filesystem

liba.so

libb.so

libc.soWAN

Credit: Igor Sfiligoi @ Fermi National Lab

FTP driver

Modular Simulation Needs Many LibrariesModular Simulation Needs Many Libraries– Devel. on workstations, then ported to grid.Devel. on workstations, then ported to grid.– Selection of library depends on analysis tech.Selection of library depends on analysis tech.

Solution: Dynamic Link with TSS and FTP:Solution: Dynamic Link with TSS and FTP:– LD_LIBRARY_PATH=/ftp/server.name/libsLD_LIBRARY_PATH=/ftp/server.name/libs

Send adapter along with job.Send adapter along with job.

select several MB from 60 GB of libraries

Anon.Login.

Related WorkRelated Work

Lots of file services for the Grid:Lots of file services for the Grid:– GridFTP, Freeldr, NeST, IBP, SRB, RFIO,...GridFTP, Freeldr, NeST, IBP, SRB, RFIO,...– Adapter interfaces with many of these!Adapter interfaces with many of these!

Why have Why have anotheranother file server? file server?– Reason 1: Must have precise Unix semantics!Reason 1: Must have precise Unix semantics!

Apps distinguish ENOENT vs EACCES vs EISDIR.Apps distinguish ENOENT vs EACCES vs EISDIR.FTP always returns error 550, regardless of error.FTP always returns error 550, regardless of error.

– Reason 2: TSS focused on easy deployment.Reason 2: TSS focused on easy deployment.No privilege required, no config files, no rebuilding, No privilege required, no config files, no rebuilding, flexible access control, ...flexible access control, ...

Remote Database AccessRemote Database Access

script

adapterTSSfile

server

filesystem

DB data

libdb.so

sim.exe

WANCFS

HEP Simulation Needs Direct DB AccessHEP Simulation Needs Direct DB Access– App linked against Objectivity DB.App linked against Objectivity DB.– Objectivity accesses filesystem directly.Objectivity accesses filesystem directly.– How to distribute application How to distribute application securelysecurely??

Solution: Remote Root Mount via TSS:Solution: Remote Root Mount via TSS: parrot –M /=/chirp/fileserver/rootdirparrot –M /=/chirp/fileserver/rootdir

DB code can read/write/lock files directly.DB code can read/write/lock files directly.

GSI Auth

GSI

Credit: Sander Klous @ NIKHEF

Performance on EDG TestbedPerformance on EDG Testbed

SetupSetup Time to InitTime to Init Time/EventTime/Event

UnixUnix 446 +/- 46446 +/- 46 64s64s

LAN/NFSLAN/NFS 4464 +/- 1724464 +/- 172 113s113s

LAN/TSSLAN/TSS 4505 +/- 1554505 +/- 155 113s113s

WAN/TSSWAN/TSS 6275 +/- 3306275 +/- 330 88s88s

Expandable FilesystemExpandable Filesystemfor Experimental Datafor Experimental Data

Credit: John Poirer @ Notre Dame Astrophysics Dept.

bufferdisk

10 GB/day todaycould be lots more!

dailytape

dailytapedaily

tapedailytapedaily

tape

25-yeararchive

analysiscode

Can only analyzethe most recent data.

Project GRANDhttp://www.nd.edu/~grand

Expandable FilesystemExpandable Filesystemfor Experimental Datafor Experimental Data

Credit: John Poirer @ Notre Dame Astrophysics Dept.

bufferdisk

10 GB/day todaycould be lots more!

dailytape

dailytapedaily

tapedailytapedaily

tape

25-yeararchive

Project GRANDhttp://www.nd.edu/~grand

fileserver

fileserver

fileserver

fileserver

Distributed Shared Filesystem

Adapter

analysiscode

Can analyze all dataover large time scales.

Appl: Distributed MD DatabaseAppl: Distributed MD DatabaseState of Molecular Dynamics Research:State of Molecular Dynamics Research:– Easy to run lots of simulations!Easy to run lots of simulations!– Difficult to understand the “big picture”Difficult to understand the “big picture”– Hard to systematically share results and ask questions.Hard to systematically share results and ask questions.

Desired Questions and Activities:Desired Questions and Activities:– ““What parameters have I explored?”What parameters have I explored?”– ““How can I share results with friends?”How can I share results with friends?”– ““Replicate these items five times for safety.”Replicate these items five times for safety.”– ““Recompute everything that relied on this machine.”Recompute everything that relied on this machine.”

GEMS: Grid Enabled Molecular SimsGEMS: Grid Enabled Molecular Sims– Distributed database for MD siml at Notre Dame.Distributed database for MD siml at Notre Dame.– XML database for indexing, TSS for storage/policy.XML database for indexing, TSS for storage/policy.

GEMS Distributed DatabaseGEMS Distributed Databasedatabase

server

catalogserver catalog

serverXML -> host1:fileAhost7:fileBhost3:fileC

A C BY Z X

XML -> host6:fileXhost2:fileYhost5:fileZ

data

XML+ Temp>300KMol==CH4

host5:fileZhost6:fileX

Credit: Jesus Izaguirre and Aaron Striegel, Notre Dame CSE Dept.

Active Recovery in GEMSActive Recovery in GEMS

GEMS and Tactical StorageGEMS and Tactical Storage

Dynamic System ConfigurationDynamic System Configuration– Add/remove servers, discovered via catalogAdd/remove servers, discovered via catalog

Policy Control in File ServersPolicy Control in File Servers– Groups can Collaborate within ConstraintsGroups can Collaborate within Constraints– Security Implemented within File ServersSecurity Implemented within File Servers

Direct Access via AdaptersDirect Access via Adapters– Unmodified Simulations can use DatabaseUnmodified Simulations can use Database– Alternate Web/Viz Interfaces for Users.Alternate Web/Viz Interfaces for Users.

OutlineOutline

Problems with the Standard ModelProblems with the Standard ModelTactical Storage SystemsTactical Storage Systems– File Servers, Catalogs, Abstractions, AdaptersFile Servers, Catalogs, Abstractions, Adapters

ApplicationsApplications::– Remote Dynamic Linking in HEP SimulationRemote Dynamic Linking in HEP Simulation– Remote Database Access in HEP SimulationRemote Database Access in HEP Simulation– Expandable Filesystem for Astrophysics DataExpandable Filesystem for Astrophysics Data– Expandable Database for Mol. Dynamics SimulationExpandable Database for Mol. Dynamics Simulation

Ongoing WorkOngoing Work– Malloc, Dynamic Views, DACLs, PINSMalloc, Dynamic Views, DACLs, PINS

Final ThoughtsFinal Thoughts

Ongoing WorkOngoing WorkMalloc() for the FilesystemMalloc() for the Filesystem– Resource owners want to limit users. (quota)Resource owners want to limit users. (quota)– End users need space assurance. (alloc)End users need space assurance. (alloc)– Need per-user allocations, not just global limits.Need per-user allocations, not just global limits.

Dynamic Data ViewsDynamic Data Views– Convert from DB to FS and back again.Convert from DB to FS and back again.

Distributed Access ControlDistributed Access Control– ACLs refer to group definitions elsewhere.ACLs refer to group definitions elsewhere.– What’s new? Fault tolerance / policy management.What’s new? Fault tolerance / policy management.

Processing in Storage (PINS)Processing in Storage (PINS)– Move computation to data.Move computation to data.– Needs new programming (scripting) model.Needs new programming (scripting) model.

Malloc in the FilesystemMalloc in the FilesystemPaper: “Grid3: Principles and Practice”Paper: “Grid3: Principles and Practice”– 90% of jobs would fail, most due to disk!90% of jobs would fail, most due to disk!

Users need to alloc disk like anything else.Users need to alloc disk like anything else.– (Not accessible to user: quotas, loopback)(Not accessible to user: quotas, loopback)– Allocation integrated with directory tree:Allocation integrated with directory tree:

scratch100 GB

job280 GB

job110 GB

job320 GB

input output taska40 GB

taskb40 GB

Dynamic Data ViewsDynamic Data Views

The same data can be perceived as either The same data can be perceived as either a file system or a database.a file system or a database.

Example:Example:– DB: get files s.t. (T>300K) && (Mol==“CH4”)DB: get files s.t. (T>300K) && (Mol==“CH4”)– FS: then process using scripts and shellFS: then process using scripts and shell– DB: associate derived files with originalDB: associate derived files with original– FS: export and tar files for others.FS: export and tar files for others.

Dynamic Data ViewsDynamic Data Viewsdatabase

server

A C BY Z X

XML -> host6:fileXhost2:fileYhost5:fileZ

Temp>300KMol==CH4

Distributed FilesystemAbstraction

App

Distributed Access Control ListsDistributed Access Control ListsUsers are very comfortable with the ACL Users are very comfortable with the ACL and group model.and group model.

Can it be adapted to a grid environment?Can it be adapted to a grid environment?– Yes, can let an ACL refer to remote server.Yes, can let an ACL refer to remote server.– Challenges: failures, caching, sharing policy.Challenges: failures, caching, sharing policy.

TSSclient

fileserver

A

Access Control Listhostname:*.nd.edu RLgroup:serverB/presidents RWL

fileserver

B

Group “Presidents”/O=NotreDame/CN=Jenkins/O=Purdue/CN=Jischke/O=Indiana/CN=Herbert

PINS: Processing in StoragePINS: Processing in Storage

Observation:Observation:– Traditional clusters separate CPU and storage into Traditional clusters separate CPU and storage into

two distinct systems/problems.two distinct systems/problems.– Distributed computing is always some direct Distributed computing is always some direct

combination of CPU and I/O needs.combination of CPU and I/O needs.

Idea: PINSIdea: PINS– Cluster HW is already a tighly integrated complex of Cluster HW is already a tighly integrated complex of

CPU and I/O. Make the SW reflect the HW.CPU and I/O. Make the SW reflect the HW.– Key: Always compute in the same place that the data Key: Always compute in the same place that the data

is located. Leave newly created data in place.is located. Leave newly created data in place.

Processing in StorageProcessing in Storage

file server file server file server file server

databaseserver

XML indexof data files

A B A C X D C (X 200)

1. Compute Y = F(X).

3. Y is stored on S3.

S1 S2 S3 S4

Y

F

2 Dispatch F to S3.

OutlineOutline

Problems with the Standard ModelProblems with the Standard ModelTactical Storage SystemsTactical Storage Systems– File Servers, Catalogs, Abstractions, AdaptersFile Servers, Catalogs, Abstractions, Adapters

ApplicationsApplications::– Remote Dynamic Linking in HEP SimulationRemote Dynamic Linking in HEP Simulation– Remote Database Access in HEP SimulationRemote Database Access in HEP Simulation– Expandable Filesystem for Astrophysics DataExpandable Filesystem for Astrophysics Data– Expandable Database for Mol. Dynamics SimulationExpandable Database for Mol. Dynamics Simulation

Ongoing WorkOngoing Work– Malloc, Dynamic Views, DACLs, PINSMalloc, Dynamic Views, DACLs, PINS

Final ThoughtsFinal Thoughts

Tactical Storage SystemsTactical Storage Systems

Separate Separate AbstractionsAbstractions from from ResourcesResourcesComponents:Components:– Servers, catalogs, abstractions, adapters.Servers, catalogs, abstractions, adapters.– Completely user level.Completely user level.– Performance acceptable for real applications.Performance acceptable for real applications.

Independent but Cooperating ComponentsIndependent but Cooperating Components– Owners of file servers set policy.Owners of file servers set policy.– Users must work within policies.Users must work within policies.– Within policies, users are free to build.Within policies, users are free to build.

AcknowledgmentsAcknowledgments

Science Science Collaborators:Collaborators:– Jesus IzaguirreJesus Izaguirre– Sander Klous Sander Klous – Peter KunzstPeter Kunzst– Erwin LaureErwin Laure– John PoirerJohn Poirer– Igor SfiligoiIgor Sfiligoi– Aaron StriegelAaron Striegel

CSE Graduate CSE Graduate Students:Students:– Paul BrennerPaul Brenner– James FitzgeraldJames Fitzgerald– Jeff HemmesJeff Hemmes– Paul MadridPaul Madrid– Chris MorettiChris Moretti– Phil SnowbergerPhil Snowberger– Justin WozniakJustin Wozniak

For more information...For more information...

Cooperative Computing LabCooperative Computing Lab

http://www.cse.nd.edu/~cclhttp://www.cse.nd.edu/~ccl

Cooperative Computing ToolsCooperative Computing Tools

http://http://www.cctools.orgwww.cctools.org

Douglas ThainDouglas Thain– [email protected]@cse.nd.edu– http://http://www.cse.nd.edu/~dthainwww.cse.nd.edu/~dthain

Extra SlidesExtra Slides

Performance – System CallsPerformance – System Calls

Performance - ApplicationsPerformance - Applications

parrot only

Performance – I/O CallsPerformance – I/O Calls

Performance – BandwidthPerformance – Bandwidth

Performance – DSFSPerformance – DSFS