CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008

CASTOR Status at RAL

CASTOR External OperationsFace To Face Meeting

Bonny Strong10 June 2008

Topics• Current Architecture• Upgrades and Changes Completed• Operational Challenges• Tape Server Issues• Diskserver Deployment Project• CCRC08• Certification Testbed• Top 5 Issues• Plans for Next 6 Months

Current Architecture - Productioncms stager: 3 head nodes

45 diskservers (552 TB)atlas stager: 3 head nodes

64 diskservers (433 TB)lhcb stager: 3 head nodes

21 diskservers (133 TB)gen stager: 3 head nodes

1 diskserver (6 TB) (repack plus smaller users)

Current Architecture – Production Shared Services

Nameservers: 2 servers for nsdaemon DNS load-balanced cluster

1 of these also hosts: vdqm, vmgr, cupv

Tape servers: 15 servers FC-attachedSTK T10000 tape drives

Upgrades and Changes Completed

• Nov 2007 – upgrade to 2.1.4 and SL4-64– Except tape servers

• Dec 2007 – SRMv2• Dec 2007 - SRMv2 in production• Apr 2008 – upgrade to 2.1.6• Oracle RAC (currently 3 nodes)

– SRMv2– Gen instance stager and dlf

Upgrades and Changes Completed - continued

• Diskserver deployment project – prototype and testing

• Nagios monitoring for 24/7 support and callout procedures

• Dynamic Information Provider developed• Migration from dcache

– Completed for lhcb– Completed for atlas disk, – Still working on atlas tape

Operational Challenges• Power failures

– 8 Feb: 1 ½ days to get castor back in server– 6 May: 1 full day to bring castor back in service– Database FS corruption. Trying to get UPS for DBs.– LSF startup issues

• Establishing after-hours callout system– Linkages between nagios and bleeper callout– Hindered by broken link between Tier1 helpdesk and GGUS– Tuning monitoring alerts– Developing operational documentation– Developing handoff procedures

• Backplane meltdown of Viglen diskservers (Feb)• Failure of new diskservers (April)• Loss of DBA due to budget cuts

Tape Server Issues• Have not yet been able to install SLC4-64bit

– NI Failure with 64-bit (now understood?)– Missing device /dev/nst1

• Fibre-channel card?• Transtec vs IBM servers?• Something special in kernel at CERN?

– Have been working on this tape server upgrade for over 6 months

– Still running castor 2.1.3 on SLC3• Much work on improving migration performance• Now seeing serious problems with servers

hanging when doing heavy load of recalls

Disk servers deployment today and tomorrow (1/2):

Kick-start script

Post-install script

Castor personalizati

on

Disk server registration

Disk servers deployment today and tomorrow (2/2):

Kick-start script

Post-install script

Castor personalizati

on

Disk server registration

Puppet

CCRC08• Big success of migration policies overcoming poor

tape migration for CMS

• Working out system and procedures for out-of-hours callouts has required much time and effort

• Biggest problems:– Power outage on 6 May– Problems with new root certificate on 20 May– SRMv2 crashes– Major problems with tape drives hanging, starting

last week

CCRC08 - continued• Smaller problems:

– Had disk-disk copies staying in PEND, but setting DiskCopyPendTimeout made this workable

– Jobs submitted to disk1 servers which became full was overcome with setting PendTimeout

– Daily restart of castor-gridftp (to clean dead processes) causing job failures: both SE and CE jobs.

CCRC08 - Tape

Transfer StatsData from Last Week

of May

Certification Testbed

• Have not delivered what we expected• Have completed installation of

infrastructure• Extending contract for testbed sys admin

another 4 months, pending STFC approval• Need to maintain release structure to be

able to take advantage of work completed

Top 5 Issues• Tape drives hanging• GC bug• Problems installing tape servers with SL4-64• Repack• Disk-disk copies stay in PEND or (earlier) multiple copies

Other issues:• Slow database, requires new stats• Support for work on migration policies• Submitting jobs to disk1 servers which become full• Bulk deletes

Plans for Next 6 Months1. Diskserver deployment project 2. Certification testbed used to test new

releases3. Taper servers upgraded to SLC4-644. Disaster recovery plan

• Based on puppet, kickstart, and DB backups• Documented• Tested

5. Support for small experiments6. Repack operational7. Xrootd and rootd

Plans for Next 6 Months - cont8. Improved resilience:

• Oracle RAC and Dataguard• Redundant, load-balanced stagers• Cold standby jobManager/LSF hosts• Failover of vdqm/vmgr/cupv

9. Decommission srmv1 endpoints10.Castor-gridftp v2 - internal11.Continued improvements in monitoring

and documentation

Plans for Next 6 Months - cont

12.Planning for move to new computer building in Dec 08

13.Possible second robot14.Tape media – T10000B15.UPS for DB servers ASAP!

CIPCASTOR Information Provider

Jens Jensen(in absentia)

CASTOR F2F mtg 10-11 June 08

Anatomy of the CIP

CASTOR

stagers CIPBack end

CIPFront end

TheGrid

Front End

• Written by Derek Ross from Tier 1• Publishes into Tier 1 BDIIs• Uses a condensed text format for

communicating with the back end– Historical reasons– One line per service class condensing all the

news that’s fit to print

Back End

• Queries stager for information about disk pools

• Could have queried SRM DBs for STDs etc– But uses configuration file for non-dynamic

information– Not all classes are published– Documented

Next steps

• Deploy backend with improved exception handling– Currently hangs if stager is down– Most recent version is ready for prod’n but

hasn’t been deployed yet– Could replace the front end and publish LDIF

directly– Packaging and improved deployment

See Also

• http://www.gridpp.ac.uk/wiki/RAL_Tier1_CASTOR_Accounting– Currently slightly out of date, refers to January

version– Can deal with Bs and KBs and MBs and GBs

and PBs and EBs

Documents

CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008