Upload
giles-walsh
View
215
Download
0
Embed Size (px)
DESCRIPTION
Current Architecture - Production cms stager: 3 head nodes 45 diskservers (552 TB) atlas stager: 3 head nodes 64 diskservers (433 TB) lhcb stager: 3 head nodes 21 diskservers (133 TB) gen stager: 3 head nodes 1 diskserver (6 TB) (repack plus smaller users )
Citation preview
CASTOR Status at RAL
CASTOR External OperationsFace To Face Meeting
Bonny Strong10 June 2008
Topics• Current Architecture• Upgrades and Changes Completed• Operational Challenges• Tape Server Issues• Diskserver Deployment Project• CCRC08• Certification Testbed• Top 5 Issues• Plans for Next 6 Months
Current Architecture - Productioncms stager: 3 head nodes
45 diskservers (552 TB)atlas stager: 3 head nodes
64 diskservers (433 TB)lhcb stager: 3 head nodes
21 diskservers (133 TB)gen stager: 3 head nodes
1 diskserver (6 TB) (repack plus smaller users)
Current Architecture – Production Shared Services
Nameservers: 2 servers for nsdaemon DNS load-balanced cluster
1 of these also hosts: vdqm, vmgr, cupv
Tape servers: 15 servers FC-attachedSTK T10000 tape drives
Upgrades and Changes Completed
• Nov 2007 – upgrade to 2.1.4 and SL4-64– Except tape servers
• Dec 2007 – SRMv2• Dec 2007 - SRMv2 in production• Apr 2008 – upgrade to 2.1.6• Oracle RAC (currently 3 nodes)
– SRMv2– Gen instance stager and dlf
Upgrades and Changes Completed - continued
• Diskserver deployment project – prototype and testing
• Nagios monitoring for 24/7 support and callout procedures
• Dynamic Information Provider developed• Migration from dcache
– Completed for lhcb– Completed for atlas disk, – Still working on atlas tape
Operational Challenges• Power failures
– 8 Feb: 1 ½ days to get castor back in server– 6 May: 1 full day to bring castor back in service– Database FS corruption. Trying to get UPS for DBs.– LSF startup issues
• Establishing after-hours callout system– Linkages between nagios and bleeper callout– Hindered by broken link between Tier1 helpdesk and GGUS– Tuning monitoring alerts– Developing operational documentation– Developing handoff procedures
• Backplane meltdown of Viglen diskservers (Feb)• Failure of new diskservers (April)• Loss of DBA due to budget cuts
Tape Server Issues• Have not yet been able to install SLC4-64bit
– NI Failure with 64-bit (now understood?)– Missing device /dev/nst1
• Fibre-channel card?• Transtec vs IBM servers?• Something special in kernel at CERN?
– Have been working on this tape server upgrade for over 6 months
– Still running castor 2.1.3 on SLC3• Much work on improving migration performance• Now seeing serious problems with servers
hanging when doing heavy load of recalls
Disk servers deployment today and tomorrow (1/2):
Kick-start script
Post-install script
Castor personalizati
on
Disk server registration
Disk servers deployment today and tomorrow (2/2):
Kick-start script
Post-install script
Castor personalizati
on
Disk server registration
Puppet
CCRC08• Big success of migration policies overcoming poor
tape migration for CMS
• Working out system and procedures for out-of-hours callouts has required much time and effort
• Biggest problems:– Power outage on 6 May– Problems with new root certificate on 20 May– SRMv2 crashes– Major problems with tape drives hanging, starting
last week
CCRC08 - continued• Smaller problems:
– Had disk-disk copies staying in PEND, but setting DiskCopyPendTimeout made this workable
– Jobs submitted to disk1 servers which became full was overcome with setting PendTimeout
– Daily restart of castor-gridftp (to clean dead processes) causing job failures: both SE and CE jobs.
CCRC08 - Tape
Transfer StatsData from Last Week
of May
Certification Testbed
• Have not delivered what we expected• Have completed installation of
infrastructure• Extending contract for testbed sys admin
another 4 months, pending STFC approval• Need to maintain release structure to be
able to take advantage of work completed
Top 5 Issues• Tape drives hanging• GC bug• Problems installing tape servers with SL4-64• Repack• Disk-disk copies stay in PEND or (earlier) multiple copies
Other issues:• Slow database, requires new stats• Support for work on migration policies• Submitting jobs to disk1 servers which become full• Bulk deletes
Plans for Next 6 Months1. Diskserver deployment project 2. Certification testbed used to test new
releases3. Taper servers upgraded to SLC4-644. Disaster recovery plan
• Based on puppet, kickstart, and DB backups• Documented• Tested
5. Support for small experiments6. Repack operational7. Xrootd and rootd
Plans for Next 6 Months - cont8. Improved resilience:
• Oracle RAC and Dataguard• Redundant, load-balanced stagers• Cold standby jobManager/LSF hosts• Failover of vdqm/vmgr/cupv
9. Decommission srmv1 endpoints10.Castor-gridftp v2 - internal11.Continued improvements in monitoring
and documentation
Plans for Next 6 Months - cont
12.Planning for move to new computer building in Dec 08
13.Possible second robot14.Tape media – T10000B15.UPS for DB servers ASAP!
CIPCASTOR Information Provider
Jens Jensen(in absentia)
CASTOR F2F mtg 10-11 June 08
Anatomy of the CIP
CASTOR
stagers CIPBack end
CIPFront end
TheGrid
Front End
• Written by Derek Ross from Tier 1• Publishes into Tier 1 BDIIs• Uses a condensed text format for
communicating with the back end– Historical reasons– One line per service class condensing all the
news that’s fit to print
Back End
• Queries stager for information about disk pools
• Could have queried SRM DBs for STDs etc– But uses configuration file for non-dynamic
information– Not all classes are published– Documented
Next steps
• Deploy backend with improved exception handling– Currently hangs if stager is down– Most recent version is ready for prod’n but
hasn’t been deployed yet– Could replace the front end and publish LDIF
directly– Packaging and improved deployment
See Also
• http://www.gridpp.ac.uk/wiki/RAL_Tier1_CASTOR_Accounting– Currently slightly out of date, refers to January
version– Can deal with Bs and KBs and MBs and GBs
and PBs and EBs