Upload
avis-smith
View
217
Download
0
Embed Size (px)
DESCRIPTION
11-13 October 2005 RAL Site Report - SLAC RAL T1 Rutherford Appleton Lab hosts the UK LCG Tier-1 –Funded via GridPP project from PPARC –Supports LCG and UK Particle Physics users VOs: –LCG: Atlas, CMS, LHCb, (Alice), dteam –Babar –CDF, D0, H1, Zeus –Bio, Pheno Expts: –Minos, Mice, SNO, UKQCD Theory users …
Citation preview
RAL Site Report
Martin Bly
HEPiX @ SLAC – 11-13 October 2005
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Overview
• Intro• Hardware• OS/Software• Services• Issues
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
RAL T1
• Rutherford Appleton Lab hosts the UK LCG Tier-1– Funded via GridPP project from PPARC– Supports LCG and UK Particle Physics users
• VOs: – LCG: Atlas, CMS, LHCb, (Alice), dteam– Babar– CDF, D0, H1, Zeus– Bio, Pheno
• Expts: – Minos, Mice, SNO, UKQCD
• Theory users• …
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Tier 1 Hardware
• ~950 CPUs in batch service– 1.4GHz, 2.66GHz, 2.8GHz – P3 and P4/Xeon (HT off)– 1.0GHz systems retiring as they fail, phase out end Oct '05 – New procurement
• Aiming for 1400+ SPECint2000/CPU• Systems for testing as part of evaluation of tender• First delivery early '06, second delivery in April/May '06
• ~40 systems for services (FEs, RB, CE, LCG servers, loggers etc)
• 60+ disk servers– Mostly SCSI attached IDE or SATA ~220TB unformatted– New procurement: probably PCI/SATA solution
• Tape robot– 6K slots, 1.2PB, 10 drives
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Tape Robot / Data Store
• Current data: 300TB, PP -> 200+TB (110TB Babar)• Castor 1 system trials
– Many CERN-specifics• HSM (Hierarchical Storage Manager)
– 500TB, DMF (Data Management Facility)• SCSI/FC• Real file system• Data migrates to tape after inactivity• Not for PP data
– Due November 05• Procurement for a new robot underway
– 3PB, ~10 tape drives– Expect to order end Oct 05– Delivery December 05– In service by March 06 (for SC4)– Castor system
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Networking
• Tier-1 backbone at 4x1Gb/s– Upgrading some links to 10Gb/s
• Multi-port 10Gb/s layer-2 switch stack as hub when available• 1Gb/s production link Tier-1 to RAL site• 1Gb/s link to SJ4 (internet)
– 1Gb/s HW firewall• Upgrade site backbone to 10Gb/s expected late '05, early '06
– Link Tier-1 to site at 10Gb/s – possible mid-2006 – Link site to SJ5 @ 10Gb/s – mid '06
• Site firewall remains an issue – limit 4Gb/s• 2x1Gb/s link to UKLight
– Separate development network in UK – Links to CERN @ 2Gb/s, Lancaster @ 1Gb/s (pending)– Managed ~90MB/s during SC2, less since
• Problems with small packet loss causing traffic limitations– Tier-1 to UKLight upgrade to 4x1Gb/s pending,10Gb/s possible– UKLight link to CERN requested @ 4Gb/s for early '06– Over-running hardware upgrade (4 days expanded to 7 weeks)
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Tier1 Network Core – SC3
7i-1
7i-3
RouterA
UKLightRouter
5510-1
5510-2
ADS Caches
dCachepools
dCachepools
Gridftpservers
Non-SChosts
Non-SC hosts
4 x 1Gb/s
4 x 1Gb/s
4 x 1Gb/s
2 x 1Gb/s
2 x 1Gb/sto CERN
290Mb/sto Lancaster
N x 1Gb/s
N x 1Gb/s
FW1Gb/s 1Gb/s 1Gb/s
to SJ4
RALSite
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
OS/Software
• Main services: – Batch, FEs, CE, RB… : SL3 (3.0.3, 3.0.4, 3.0.5)
• LCG 2_6_0• Torque/MAUI• 1 Job/CPU
– Disk: RH72 custom, RH73 custom– Some internal services on SL4 (loggers)– Project to use SL4.n for disk servers underway
• Solaris disk servers decommissioned– Most hardware sold
• AFS on AIX– Transarc– Project to move to Linux (SL3/4)
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Services (1) - Objyserv
• Objyserv database service (Babar)– Old service on traditional NFS server
• Custom NFS, heavily loaded, unable to cope with increased activity on batch farm due to threading issues in server
• Additional server solution with same technology not tenable– New service:
• Twin ams-based servers, 2 CPUs, HT on, 2 GB RAM • SL3, RAID1 data disks• 4 servers per host system
– Internal redirection using iptables to different server ports depending which of 4 IP addressed used to make the connection
• Able to cope with some ease: 600+ clients• Contact Chris Brew
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Services (2)Home file system
• Home file system migration– Old system:
• ~85GB on A1000 RAID array• Sun Ultra10, Solaris 2.6, 100Mb/s NIC• Failed to cope with some forms of pathological use
– New system:• ~270GB SCSI RAID5, 6 disk chassis• 2.4GHz Xeon, 1GB RAM, 1Gb/s NIC• SL3, ext3• Stable under I/O and quota testing, and during backup
– Migration:• 3 weeks planning• 1 week of nightly rsync followed by checksuming
– Convince ourselves the rsync works• 1 day farm shutdown to migrate• 1 single file detected to have checksum error
– Quotas for users unchanged…– Keep the old system on standby to restore its backups
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Services (3) – Batch Server
• Catastrophic disk failure on Saturday late evening over a holiday weekend– Staff not expected back till 8:30am Wednesday
• Problem noted Tuesday morning– Initial inspection - disk a total failure– No easy access to backups
• Backup tape numbers in logs on failed disk!– No easy recovery solution with no other system staff available– Jobs appear happy – terminating OK, sending sandboxes to gatekeeper etc. But no
accounting data, no new jobs started.• Wednesday:
– Hardware `revised’ with two disks, Software RAID1, clean install of SL3– Backups located, batch/scheduling configs recovered from tape store– System restarted with MAUI off to allow Torque to sort itself out
• Queues came up closed– MAUI restarted– Service picked up smoothly
• Lessons:– Know where the backups are and how to identify which tapes are the right ones– Unmodified batch workers are not good enough for system services
11-13 October 2005
RAL Site Report - HEPiX @ SLAC
Issues
• How to run resilient services on non-resilient hardware?– Committed to run 24x365, 98%+ uptime– Modified batch workers with extra disks and HS caddies as servers– Investigating HA-Linux
• Batch server and scheduling experiments positive• RB,CE, BDII, R-GMA …
– Databases• Building services maintenance
– Aircon, power• Already two substantial shutdowns in 2006• New building
• UKLight is a development project network– There have been problems with managing expectations for production
services on a development network• Unresolved packet loss in CERN-RAL transfers
– Under investigation• 10Gb/s kit expensive
– Components we would like are not yet affordable/available– Pushing against LCG turn-on date
Questions?