11
Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

Embed Size (px)

Citation preview

Page 1: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

Storage, Networks, Data Management

Report on Parallel Session

OSG Meet 8/2006

Frank Würthwein (UCSD)

Page 2: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

Agenda & Focus

• 9 talks– ATLAS & CMS experience– Status of SRM and 3 of its implementations– Ultralight & Lambdastation– Forecasting

• 2 discussions– Performance Monitoring– Storage on OSG

Page 3: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

LHC Experience• Large investment in 2006

– 10-15FTE @ T1– More than 2PB of disk in SRM/dCache

• Roughly 50TB to 500TB across 18 installations

• Data Management systems VO specific• Data XFER Commissioning

– 1-5Gbps or 1PByte per month of IO per site is SoA via WAN (stream read/write)

– 8-25Gbps of IO per site is SoA via LAN (random access read)

• Concerns about dCache namespace scalability.– Stay below ~10k files per directory– Stay below XXHz of file access rate. What’s XX?

Page 4: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

FTS system

• Sits on top of SRM

• Purpose to throttle WAN IO according to site policies at LCG sites, including CERN.

• Bleeding edge with some issues

• Requires client tools on OSG sites

Page 5: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

SRM Status

• GIN data XFER interop in SRM v1.1– GGF initiative to test Xgrid interop.

• Global agreement on srm v2.2• Explicit space reservation• Namespace discovery and manipulation (ls, mkdir, …)• StorageClasses based on latency

– Backward compatibility NOT guaranteed for all implementations (dCache as of now backward compatible).

– Testing of interop ongoing with development for 3 implementations.

– New deployments across WLCG begin in 11/06

Page 6: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

SRM Implementation Status• All 3 coming with SRM v2.2• DRM (LBNL)

• On top of a single filesystem• Managing multiple gsiftp servers

• L-Store (Vanderbuilt)• Arbitrarily sized data object store, i.e. files are sliced in

bitstreams, transparent to users.• Implements DRM with IBP as backend for single gsiftp server

per srm.

• dCache (FNAL, DESY, et al.)• New monitoring for troubleshooting

– Used for debugging who, where, when, what?

• Pluggable X.509 authz integrated with OSG authz

Page 7: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

Networking (1)

• Ultralight:• Helping to understand the network as a

managed resource.• Co-scheduling of network & storage• Recent focus on end-host performance tests

leading to 4.3Gbps disk-to-disk between 2 systems @ CERN & Caltech

• Collaboration with REDDnet:– IBP depots across the US

• Clyde: Generic testing & validation framework

Page 8: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

Networking (2)

• LambdaStation– Dedicated virtual circuits between sites dynamically

provisioned by network providers in 2007ff .– Need to be able to selectively “switch” data flows from

campus network onto these circuits.– Relationship to storage:

• Storage, e.g. SRM/dCache, knows who, when, where, and how much to Xfer.

• Need that knowledge to switch flows onto circuits according to policy.

Page 9: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

Predicting Scheduling Wait Time

• Two problems:– How long a wait time should I expect if I submit a

workload to cluster X ?– What are chances of finishing my workload within t ?

• Predict answe4rs to these Q’s based on historical data.

• Installed on a number of clusters in teragrid, CDF@fnal, naregi@japan, …

• Doing this for batch systems and networks.

Page 10: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

Conclusion of Discussions

• Let’s focus on doing simple things first– Provide access to a broader set of communities

on the deployments we have.– Understand what we can, and can not sell to the

user communities.

• Understand management model for storage on the grid for a broad set of communities.

Page 11: Storage, Networks, Data Management Report on Parallel Session OSG Meet 8/2006 Frank Würthwein (UCSD)

Fkw’s Conclusion of Session

• Staging data in/out of compute sites is appreciated as a major focus in OSG for the next year.

• LHC community operating on OSG at a scale orders of magnitude above anybody else.

• How can others benefit from this?

• Networking preparing for solving the bottlenecks for the next order of magnitude.