28
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th, 2007 Credits: Giuseppe Misurelli

INFSO-RI-508833 Enabling Grids for E-sciencE Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

Embed Size (px)

Citation preview

Page 1: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

Data Management System

Jean Salzemann CNRS/IN2P3

ACGRID School,Hanoi (Vietnam) November 6th, 2007

Credits: Giuseppe Misurelli

Page 2: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

2

Enabling Grids for E-sciencE

INFSO-RI-508833

Outline

• Grid Data Management Challenge

• Storage Elements and SRM

• LFC File Catalog

• Data Movement Utils

Page 3: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

3

Enabling Grids for E-sciencE

INFSO-RI-508833

Grid DM Challenge

• Grid Data Management Challenge

• Storage Elements and SRM

• LCG File Catalog

• Data Movement Utils

Page 4: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

4

Enabling Grids for E-sciencE

INFSO-RI-508833

The Grid DM Challenge /1

NEEDS REQUIREMENTS SOLUTIONS

Heterogeneous: Data are stored on different storage systems using different technologies.

A common interface to storage resources is required in order to hide the underlying complexity.

Storage Resource Manager (SRM) interface;

(gLite File I/O Server)

Distributed: Data are stored in different locations; in most cases there is no shared file system or common namespace.

Data need to be moved between different locations.

Need to keep track where data are stored.

File Transfer Service (FTS) – to move files among GRID sites.

Catalog – to keep track where data are stored.

Data Retrieving: Applications are located in different places from where data are stored.

Need of scheduled reliable file transfer service.

File Transfer Service•Data Scheduler •File Placement Service•Transfer Agent•File Transfer Library

Security: Data must be managed according to the VO membership access control policy.

Centralized Access control Service.

File Authorization Service

Page 5: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

5

Enabling Grids for E-sciencE

INFSO-RI-508833

The Grid DM Challenge /2

• DM works with files, this assumption is due the following reasons: – semantic of file is very good understood by everyone

– file is the smallest granularity of data.

Page 6: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

7

Enabling Grids for E-sciencE

INFSO-RI-508833

Data Management Services

• Storage Element – common interface to storage

– Storage Resource Manager Castor, dCache, DPM, …– POSIX-I/O gLite-I/O– Native Access protocols rfio, dcap– Transfer protocols gsiftp

• Catalogs – keep track where data are stored

– File Catalog – Replica Catalog LFC, Metadata Catalog (es. AMGA)– File Authorization Service– Metadata Catalog

• File Transfer – schedules reliable file transfer

– Data Scheduler – File Transfer Service lcg-utils, gLite FTS

Page 7: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

8

Enabling Grids for E-sciencE

INFSO-RI-508833

SE and SRM

• Grid Data Management Challenge

• Storage Elements and SRM

• LFC File Catalog

• Data Movement Utils

Page 8: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

9

Enabling Grids for E-sciencE

INFSO-RI-508833

SRM in an example /1

She is running a job which needs:Data for physics event reconstructionSimulated DataSome data analysis filesShe will write files remotely too

They are at CERNIn dCache

They are at FermilabIn a disk array

They are at Nikhefin a classic SE

Page 9: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

10

Enabling Grids for E-sciencE

INFSO-RI-508833

SRM in an example /2

dCacheOwn system, own protocols and parameters

CastorNo connection with dCache or DPM

gLite DPMIndependent system from dCache or Castor

You as a user need to know all

the systems!!!

SR

M

I talk to them on your behalfI will even allocate space for your filesAnd I will use transfer protocols to send your files there

Page 10: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

11

Enabling Grids for E-sciencE

INFSO-RI-508833

Storage Resource Management

• Data are stored on disk pool servers or Mass Storage Systems

• storage resource management needs to take into account– Transparent access to files (migration to/from disk pool)– File pinning– Space reservation– File status notification– Life time management

• The SRM (Storage Resource Manager) takes care of all these details– The SRM is a single interface that takes care of local storage

interaction and provides a Grid interface to the outside world

Page 11: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

12

Enabling Grids for E-sciencE

INFSO-RI-508833

gLite SE types /1

• gLite 3.0 data access protocols:– File Transfer: GSIFTP (GridFTP)– File I/O (Remote File access): gsidcap

insecure RFIO

secured RFIO (gsirfio)

• Classic SE:– GridFTP server– Insecure RFIO daemon (rfiod) – only LAN limited file access– Single disk or disk array– No quota management– Does not support the SRM interface

Page 12: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

13

Enabling Grids for E-sciencE

INFSO-RI-508833

gLite SE types /2

• Mass Storage Systems (Castor)– Files migrated between front-end disk and back-end tape

storage hierarchies– GridFTP server– Insecure RFIO (Castor)– Provide a SRM interface with all the benefits

• Disk pool managers (dCache and gLite DPM)– manage distributed storage servers in a centralized way– Physical disks or arrays are combined into a common (virtual)

file system– Disks can be dynamically added to the pool – GridFTP server– Secure remote access protocols (gsidcap for dCache, gsirfio for

DPM)– SRM interface

Page 13: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

14

Enabling Grids for E-sciencE

INFSO-RI-508833

File Catalog and DM Tools

• Grid Data Management Challenge

• Storage Elements and SRM

• LFC File Catalog

• Data Movement Utils

Page 14: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

15

Enabling Grids for E-sciencE

INFSO-RI-508833

Files & replicas: Naming Conventions

• Logical File Name (LFN) – An alias created by a user to refer to some item of data, e.g. “lfn:cms/20030203/run2/track1”

• Globally Unique Identifier (GUID) – A non-human-readable unique identifier for an item of data, e.g.

“guid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6”

• Site URL (SURL) (or Physical File Name (PFN) or Site FN)– The location of an actual piece of data on a storage system, e.g.

“srm://pcrd24.cern.ch/flatfiles/cms/output10_1” (SRM) “sfn://lxshare0209.cern.ch/data/alice/ntuples.dat” (Classic SE)

• Transport URL (TURL)– Temporary locator of a replica + access protocol: understood by a SE, e.g.

“rfio://lxshare0209.cern.ch//data/alice/ntuples.dat”

Page 15: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

16

Enabling Grids for E-sciencE

INFSO-RI-508833

• Provides• Bulk operations• Cursors for large queries• Timeouts and retries for client operations

• Features• User exposed transaction API• Hierarchical namespace and namespace operations• Integrated GSI Authentication and Authorization• Access Control Lists (Unix Permissions and POSIX ACLs)• Checksums

Supported database backends: Oracle and MySQL

LFC - Description

Page 16: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

17

Enabling Grids for E-sciencE

INFSO-RI-508833

• LFC stores both logical and physical mappings for the file in the same database Speed up of operations• Treats all entities as files in a UNIX-like filesystem. • File API also similar to UNIX (create(), mkdir(), chown()….)• Hierarchical namespace of LFNs mapped to the GUIDs• GUIDs mapped to the physical locations of file replicas in the storage• System attributes of files (creation time, file size and checksum…) stored as LFN attributes • One field for user-defined metadata • Multiple LFNs per GUID allowed as symbolic links to the primary LFN.

File Metadata

Logical File Name (LFN)

GUID

System Metadata (ACLs, Ownership,etc

Symlinks

Link name

User Metadata

User defined Metadata

File Replica

Storage File Name

Storage Host

LFC - Architecture

Page 17: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

18

Enabling Grids for E-sciencE

INFSO-RI-508833

File Catalog and DM Tools

• Grid Data Management Challenge

• Storage Elements and SRM

• LFC File Catalog

• Data Movement Utils

Page 18: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

19

Enabling Grids for E-sciencE

INFSO-RI-508833

GFAL: Grid File Access Library

Interactions with SE require some components:→ File catalog services to locate replicas→ SRM→ File access mechanism to access files from the SE on the WN

GFAL does all this tasks for you: → Hides all these operations→ Presents a POSIX interface for the I/O operations

→ User can create all commands needed for storage management

→ It offers as well an interface to SRM Supported protocols:

→ file (local or nfs-like access) → dcap, gsidcap and kdcap (dCache access)→ rfio (castor access) and gsirfio (dpm)

Page 19: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

20

Enabling Grids for E-sciencE

INFSO-RI-508833

lcg-utils DM tools

• High level interface (CL tools and APIs) to– Upload/download files to/from the Grid (UI,CE and WN <--->

SEs)– Replicate data between SEs and locate the best replica available– Interact with the file catalog

• Definition: A file is considered to be a Grid File if it is both physically present in a SE and registered in the File Catalog– lfc commands to interact with file catalog features– lcg-utils commands ensure the consistency between files in the

Storage Elements and entries in the File Catalog

Page 20: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

21

Enabling Grids for E-sciencE

INFSO-RI-508833

LFC commands

lfc-chmod Change access mode of the LFC file/directory

lfc-chown Change owner and group of the LFC file-directory

lfc-delcomment Delete the comment associated with the file/directory

lfc-getacl Get file/directory access control lists

lfc-ln Make a symbolic link to a file/directory

lfc-ls List file/directory entries in a directory

lfc-mkdir Create a directory

lfc-rename Rename a file/directory

lfc-rm Remove a file/directory

lfc-setacl Set file/directory access control lists

lfc-setcomment Add/replace a comment

LFC Catalog commands

Page 21: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

22

Enabling Grids for E-sciencE

INFSO-RI-508833

Listing the entries of a LFC directorylfc-ls [-cdiLlRTu] [--class] [--comment] [--deleted] [--display_side] [--ds] path…

where path specifies the LFN pathname (mandatory)

– Remember that LFC has a directory tree structure /grid/<VO_name>/<you create it>

All members of a VO have read-write permissions for their own directory

– You can set LFC_HOME to use relative path> lfc-ls /grid/gilda/misurelli

> export LFC_HOME=/grid/gilda

> lfc-ls -l misurelli

lfc-ls

Defined by the userLFC Namespace

Page 22: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

23

Enabling Grids for E-sciencE

INFSO-RI-508833

lfc-mkdir

Creating directories in the LFClfc-mkdir [-m mode] [-p] path...

• Where path specifies the LFC pathname

• Remember that while registering a new file (using lcg-cr, for example) the corresponding destination directory must be created in the catalog beforehand:

– lfc-mkdir /grid/gilda/misurelli/practise

– lfc-ls -l /grid/gilda/misurelli

Page 23: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

24

Enabling Grids for E-sciencE

INFSO-RI-508833

lcg-utils commands

Replica Management

lcg-cp Copies a grid file to a local destination

lcg-cr Copies a file to a SE and registers the file in the catalog

lcg-del Delete one file

lcg-rep Replication between SEs and registration of the replica

lcg-gt Gets the TURL for a given SURL and transfer protocol

lcg-sd Sets file status to “Done” for a given SURL in a SRM request

File Catalog Interaction

lcg-aa Add an alias in LFC for a given GUID

lcg-ra Remove an alias in LFC for a given GUID

lcg-rf Registers in LFC a file placed in a SE

lcg-uf Unregisters in LFC a file placed in a SE

lcg-la Lists the alias for a given SURL, GUID or LFN

lcg-lg Get the GUID for a given LFN or SURL

lcg-lr Lists the replicas for a given GUID, SURL or LFN

Page 24: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

25

Enabling Grids for E-sciencE

INFSO-RI-508833

lcg-utils: lcg-cr

• Upload a file to a SE and register it into the catalog

lcg-cr -d dest_file | dest_host [-g guid] [-l lfn] [-v | --verbose] --vo vo src_file

where:– dest_host is the fully qualified hostname of the destination SE– dest_file is a valid SURL (both sfn:// or srm:// format are valid)– guid specifies the Grid Unique IDentifier. If this option is not

present, a GUID is generated internally– lfn specifies the Logical File Name associated with the file– vo specifies the Virtual Organization the user belongs to– src_file specifies the source file name: the protocol can be file:///

or gsiftp:///

Page 25: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

26

Enabling Grids for E-sciencE

INFSO-RI-508833

edg-gridftp-exists TURL Checks if file/dir exists on a SE

edg-gridftp-ls TURL Lists a directory on a SE

globus-url-copy srcTURL dstTURL Copies files between SEs

edg-gridftp-mkdir TURL Creates a directory on a SE

edg-gridftp-rename srcTURL dstTURL Renames a file on a SE

edg-gridftp-rm TURL Removes a file from a SE

edg-gridftp-rmdir TURL Removes a directory on a SE

Used for low level management of file/directories in SEsUsed for low level management of file/directories in SEs

Advanced utilities: gridftp commands

Page 26: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

27

Enabling Grids for E-sciencE

INFSO-RI-508833

Globus-url-copy

• globus-url-copy srcTURL destTURL– low level file transfer

• Interaction with RLS components– edg-lrc command (actions on LRC)

– edg-rmc command (actions on RMC)

– C++ and Java API for all catalog operations http://edg-wp2.web.cern.ch/edg-wp2/replication/docu/r2.1/edg-lrc-devguide.pdf http://edg-wp2.web.cern.ch/edg-wp2/replication/docu/r2.1/edg-rmc-devguide.pdf

• Using low level CLI and API is STRONGLY discouragedUsing low level CLI and API is STRONGLY discouraged– Risk: loose consistency between SEs and catalogues– REMEMBERREMEMBER: a file is in Grid if it is BOTH:BOTH:

stored in a Storage Element registered in the file catalog

Page 27: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

28

Enabling Grids for E-sciencE

INFSO-RI-508833

References

• gLite documentation homepage– http://glite.web.cern.ch/glite/documentation/default.asp

• LFC and DPM documentation– https://uimon.cern.

ch/twiki/bin/view/LCG/DataManagementDocumentation

Page 28: INFSO-RI-508833 Enabling Grids for E-sciencE  Data Management System Jean Salzemann CNRS/IN2P3 ACGRID School, Hanoi (Vietnam) November 6th,

29

Enabling Grids for E-sciencE

INFSO-RI-508833

Questions…