44
Efficient Access to Efficient Access to Many Small Files Many Small Files in a Grid Filesystem in a Grid Filesystem Douglas Thain and Christopher Douglas Thain and Christopher Moretti Moretti University of Notre Dame University of Notre Dame

Efficient Access to Many Small Files in a Grid Filesystem Douglas Thain and Christopher Moretti University of Notre Dame

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Efficient Access toEfficient Access toMany Small FilesMany Small Files

in a Grid Filesystem in a Grid Filesystem

Douglas Thain and Christopher MorettiDouglas Thain and Christopher Moretti

University of Notre DameUniversity of Notre Dame

Efficient Access to ManyEfficient Access to ManySmall (and Big) FilesSmall (and Big) Files in a Grid Filesystem in a Grid Filesystem

Douglas Thain and Christopher MorettiDouglas Thain and Christopher Moretti

University of Notre DameUniversity of Notre Dame

AbstractAbstractMany grid data tools focus on transferring, Many grid data tools focus on transferring, storing, and managing large (GB-TB) files.storing, and managing large (GB-TB) files.

But, many users need to manage, transfer, and But, many users need to manage, transfer, and process lots (1000s) of small (KB-MB) files.process lots (1000s) of small (KB-MB) files.

We describe protocols and interfaces for We describe protocols and interfaces for manipulating many small files over wide area manipulating many small files over wide area networks. (Doesn’t hurt large files, either.)networks. (Doesn’t hurt large files, either.)

Implemented in the Implemented in the ChirpChirp file system. file system.

Performance:Performance:– Best case: order of magnitude improvement.Best case: order of magnitude improvement.– Worst case: no slower than before.Worst case: no slower than before.

The Small File ProblemThe Small File Problem

Who has lots of small files?Who has lots of small files?

Anyone using a batch system.Anyone using a batch system.– One file for submit, input, output, error, log...One file for submit, input, output, error, log...

Anyone using a large software package.Anyone using a large software package.– Executables, libraries, config files...Executables, libraries, config files...

Anyone using a filesystem like a database.Anyone using a filesystem like a database.– Genomics, astronomy, physics...Genomics, astronomy, physics...

Anyone who likes to write shell scripts.Anyone who likes to write shell scripts.– foreach host in list ssh $host > $host.outputforeach host in list ssh $host > $host.output

Why is this a problem?Why is this a problem?

Users do the “sensible” thing:Users do the “sensible” thing:– foreach file in (list) do transfer doneforeach file in (list) do transfer done

The “sensible” thing performs miserably:The “sensible” thing performs miserably:– New TCP ConnectionNew TCP Connection– SSL AuthenticationSSL Authentication– Configuration OperationsConfiguration Operations– Slow Start AgainSlow Start Again

Result is KB/s on a GB/s link.Result is KB/s on a GB/s link.

Why not just use tar?Why not just use tar?

If you can, you should!If you can, you should!Sometimes you cannot:Sometimes you cannot:– The system semantics demand multiple files.The system semantics demand multiple files.– Packing and unpacking can be very slow.Packing and unpacking can be very slow.– Not enough disk space to unpack.Not enough disk space to unpack.– Different apps select different data subsets.Different apps select different data subsets.– Using an existing script or program.Using an existing script or program.

Users don’t know or care that it’s a dist Users don’t know or care that it’s a dist system, why should they change?system, why should they change?

The Challenge:The Challenge:

How to design How to design interfacesinterfacesso that users get the expectedso that users get the expected

performance and behavior?performance and behavior?

Chirp and Parrot:Chirp and Parrot:A Grid FilesystemA Grid Filesystem

Requirements for a Grid FilesystemRequirements for a Grid Filesystem

Transparent access to files in the same Transparent access to files in the same manner as a local Unix filesystem.manner as a local Unix filesystem.Non privileged deployment at both client Non privileged deployment at both client and server. (root not possible on the grid.)and server. (root not possible on the grid.)User control over policies for naming, User control over policies for naming, caching, consistency, and fault tolerance.caching, consistency, and fault tolerance.Flexible access controls for sharing.Flexible access controls for sharing.Good performance on both small and Good performance on both small and large files.large files.

Chirp/Parrot – A Grid Chirp/Parrot – A Grid FilesystemFilesystem

Chirp

OrdinaryUnix

Filesystem

OrdinaryUnix

Program

Parrot

unixsystem

calls

Authorization:kerberos:[email protected] RWLDAglobus:/O=ND/CN=Joe RWLDAhostname:*.nd.edu RLgroup:server.nd.edu/team RWL

Protocol:open / pread / pwrite / closestat / mkdir / rmdir / unlinkgetfile / putfile / movefile

Authentication:Kerberos / Globus / Hostname / Unix

Single TCP Stream

NoPrivs

Needed!

NoPrivs

Needed!

Automatic Recoveryptracetrap

Ordinary Unix CommandsOrdinary Unix Commands

> parrot tcsh> parrot tcsh

> ls /chirp> ls /chirp

alpha.nd.edualpha.nd.edu

beta.nd.edubeta.nd.edu

......

> cd /chirp/alpha.nd.edu/mydir> cd /chirp/alpha.nd.edu/mydir

> cp /tmp/bigdata .> cp /tmp/bigdata .

> emacs mydata.txt> emacs mydata.txt

Parrot Specific CommandsParrot Specific Commands

> parrot tcsh> parrot tcsh

> parrot_whoami> parrot_whoami

globus:/O=ND/CN=Joeglobus:/O=ND/CN=Joe

> parrot_getacl /chirp/alpha.nd.edu/> parrot_getacl /chirp/alpha.nd.edu/

kerberos:[email protected] RWLDAkerberos:[email protected] RWLDA

globus:/O=ND/CN=Joe RWLglobus:/O=ND/CN=Joe RWL

hostname:*.nd.edu RLhostname:*.nd.edu RL

Chirp as Remote FilesystemChirp as Remote Filesystem

Grid Site A Grid Site B

App

Parrot

App

Parrot

App

Parrot

App

Parrot

App

Parrot

App

Parrot

App

Parrot

ChirpServer

UnixFilesystem

GridMiddleware

App

ParrotCert

Securedby GSI

Chirp as Cluster FilesystemChirp as Cluster Filesystem

Grid Site A Grid Site B

App

Parrot

App

Parrot

App

Parrot

App

Parrot

App

Parrot

App

Parrot

App

Parrot

ChirpServer

UnixFilesystem

ChirpServer

UnixFilesystem

ChirpServer

UnixFilesystem

ChirpServer

UnixFilesystem

dirserver

auxdb

http://www.cse.nd.edu/~ccl/viz

Sample ApplicationsSample Applications

Image Processing for BiometricsImage Processing for Biometrics– Moretti et al, PCGRID 2007Moretti et al, PCGRID 2007

Bioinformatics on EGEEBioinformatics on EGEE– Blanchet et al, Grid 2006Blanchet et al, Grid 2006

High Energy Physics on LCGHigh Energy Physics on LCG– Sfiligoi et al, CHEP 2005, Sfiligoi et al, CHEP 2005,

Molecular Dynamics RepositoryMolecular Dynamics Repository– Wozniak et al, HPDC 2005Wozniak et al, HPDC 2005

Remote DB Access on EDGRemote DB Access on EDG– Klous et al, CCPE 2005Klous et al, CCPE 2005

Protocols for Small FilesProtocols for Small Files

What About FTP?What About FTP?

FTP is a great FTP is a great data transferdata transfer system, but it system, but it was never designed to be a was never designed to be a file systemfile system::– New TCP stream per data transfer.New TCP stream per data transfer.– New TCP stream for each directory list.New TCP stream for each directory list.– Lots of connections can overwhelm net devices.Lots of connections can overwhelm net devices.– Coarse errors: 550 for all file system errors.Coarse errors: 550 for all file system errors.– Semantic problems: e.g. empty directory.Semantic problems: e.g. empty directory.– Unix access controls, (But, see SecPAL)Unix access controls, (But, see SecPAL)– Wildly varying implementations and support.Wildly varying implementations and support.

FTP Protocol ReminderFTP Protocol Reminder

AUTH GSSAPIMICMIC

Data Transfer

AUTH GSSAPIMICMIC

PORTRETR

Control Connection

Data Connection

FTPClient

FTPServer

Minimum of four round trips (plus auth overhead) to fetch a file +

loss of TCP window.

Common practice is new control connection for

every data transfer!

What About NFS?What About NFS?

NFS was designed for a local area NFS was designed for a local area network among (relatively) trusted hosts.network among (relatively) trusted hosts.– Fine-grained file access very slow on WAN.Fine-grained file access very slow on WAN.– Kernel support and root assistance needed to Kernel support and root assistance needed to

start server, mount client, change target.start server, mount client, change target.– Unix UID for ownership, access control.Unix UID for ownership, access control.– Need to bind to privileged port, often filtered.Need to bind to privileged port, often filtered.– Use of “file handles” to refer to files makes it Use of “file handles” to refer to files makes it

very difficult to build a user-level server.very difficult to build a user-level server.+ lots of lookup operations over the WAN.+ lots of lookup operations over the WAN.

NFS Protocol ReminderNFS Protocol Reminder

NFSClient

NFSServer

On a WAN, throughput limited to 4KB/latency.

10ms = 400 KB/s

100ms = 40 KB/s

lookup(00,a)lookup(10,b)lookup(20,c)

...

read 4KBread 4KBread 4KB

...

Chirp Hybrid Protocol OverviewChirp Hybrid Protocol Overview

ChirpClient

ChirpServer

auth globus (8 RTT)openreadwriteclose...getfile(“mydata”)

putfile(“otherdata”,size)

size and data

data

Protocol ComparisonProtocol Comparison

FTP - Stream per FileFTP - Stream per File– Latency = 4+ RTT for each fileLatency = 4+ RTT for each file– Throughput = TCP limit after slow startThroughput = TCP limit after slow start

NFS – Remote Procedure CallNFS – Remote Procedure Call– Latency = 1 RTT for each fileLatency = 1 RTT for each file– Throughput = block size / latencyThroughput = block size / latency

Chirp - HybridChirp - Hybrid– Latency = 1 RTT for each fileLatency = 1 RTT for each file– Throughput = TCP limit in steady stateThroughput = TCP limit in steady state

Local Area PerformanceLocal Area Performance

Wide Area PerformanceWide Area Performance

Real WAN PerformanceReal WAN Performance

Interfaces for Small FilesInterfaces for Small Files

Standard Unix CopyStandard Unix Copy

Parrot

cp

Local Chirp

LocalDisk

ChirpServer

open(source)

open(source)

read

read

open

open

write

write

open(source)open(target)

loop: read/write

cp /tmp/source /chirp/B/target

Problem:Problem:The system does not know the The system does not know the

contextcontext of the operation! of the operation!

Solution:Solution:Introduce a higher-level operationIntroduce a higher-level operationcopyfilecopyfile that exploits the context. that exploits the context.

Improved Copy with CopyfileImproved Copy with Copyfile

Parrot

newcp

Local Chirp

LocalDisk

ChirpServer

copyfile(source,target)

open(source)

open(source)

putfile(target)

putfile(target)

cp /tmp/source /chirp/B/target

Is it reasonable to modify cp?Is it reasonable to modify cp?

Installation:Installation:– Cannot modify /bin/cp.Cannot modify /bin/cp.– Install new parrot_cpInstall new parrot_cp– Alias cp or link named “cp” in PATH.Alias cp or link named “cp” in PATH.

Backwards compatibility:Backwards compatibility:– parrot_cp without Parrot falls back to normal.parrot_cp without Parrot falls back to normal.– Ordinary cp on Parrot behaves as before.Ordinary cp on Parrot behaves as before.– Parrot_cp on a different filesystem falls back.Parrot_cp on a different filesystem falls back.

Improved Copy with CopyfileImproved Copy with Copyfile

Parrot

newcp

Chirp

ChirpServer

B

copyfile(source,target)

thirdput(source,B,target)

ChirpServer

A

cp /chirp/A/source /chirp/B/target

putfile(target)thirdput(source,B,target)

Directory CopyDirectory Copy

ChirpServer

B

ChirpServer

A

ACL X Y Z

mydir

thirdput(/mydir/X,B,/mydir/X)

X

setacl(mydir)

ACL

mydir

thirdput(/mydir/X,B,/mydir/Y)

Y

thirdput(/mydir/X,B,/mydir/Z)

Z

cp

Parrot

mkdir(mydir)

cp –r /chirp/A/mydir /chirp/B/mydir

Improved Directory CopyImproved Directory Copy

ChirpServer

B

ChirpServer

A

ACL X Y Z

mydir

ACL X Y Z

mydir

mkdirputfile*3setacl

cp

Parrot

thirdput(/mydir,B,/mydir)

cp –r /chirp/A/mydir /chirp/B/mydir

Third Party PerformanceThird Party Performance

You get the idea...You get the idea...

ls –la Dls –la D– Original: getdir D + N*statOriginal: getdir D + N*stat– Improved: getlongdir DImproved: getlongdir D

rm –rf Drm –rf D– Original: getdir D + N*unlink (recursive)Original: getdir D + N*unlink (recursive)– Improved: rmall DImproved: rmall D

md5sum Fmd5sum F– Original: open F + N*read + closeOriginal: open F + N*read + close– Improved: md5 FImproved: md5 F

Final ExampleFinal Example

ls –la /chirp/alpha/datals –la /chirp/alpha/data

md5sum /chirp/alpha/data/*md5sum /chirp/alpha/data/*

cp -r /chirp/alpha/datacp -r /chirp/alpha/data

/chirp/beta/data/chirp/beta/data

md5sum /chirp/beta/data/*md5sum /chirp/beta/data/*

rm –rf /chirp/alpha/datarm –rf /chirp/alpha/data

Original ImplementationOriginal Implementation

ls -la md5 cp rm cp md5

chirpserver

A

chirpserver

B

parrot

app

Improved ImplementationImproved Implementation

rm

chirpserver

A

chirpserver

B

parrot

app

ls -la md5 cp md5

Performance on ScriptPerformance on Script

0

20

40

60

80

100

120

140

160

180

list

chec

ksum

mov

e

chec

ksum

dele

te

tim

e (s

eco

nd

s)

Original

Improved

The Challenge:The Challenge:

How to design How to design interfacesinterfacesso that users get the expectedso that users get the expected

performance and behavior?performance and behavior?

SummarySummaryGood small file performance requires Good small file performance requires attention to low level network protocols.attention to low level network protocols.– getfile, putfile, thirdput, rmall, checksumgetfile, putfile, thirdput, rmall, checksum

Exploiting protocols requires minor Exploiting protocols requires minor changes to the Unix I/O interface.changes to the Unix I/O interface.– copyfile, rmall, checksum, others?copyfile, rmall, checksum, others?

Easy to apply those changes in a user Easy to apply those changes in a user transparent way.transparent way.– cp, rm, md5sum all operate as normalcp, rm, md5sum all operate as normal

Usable performance in a wide-area FS.Usable performance in a wide-area FS.

For more information...For more information...

Douglas ThainDouglas Thain–[email protected]@nd.edu

Chris MorettiChris Moretti–[email protected]@nd.edu

Parrot and ChirpParrot and Chirp–http://www.cctools.orghttp://www.cctools.org