RAL Site Report Martin Bly SLAC 11-13 October 2005

RAL Site Report

Martin Bly

HEPiX @ SLAC – 11-13 October 2005

11-13 October 2005

RAL Site Report - HEPiX @ SLAC

Overview

• Intro• Hardware• OS/Software• Services• Issues

11-13 October 2005


RAL T1

• Rutherford Appleton Lab hosts the UK LCG Tier-1– Funded via GridPP project from PPARC– Supports LCG and UK Particle Physics users

• VOs: – LCG: Atlas, CMS, LHCb, (Alice), dteam– Babar– CDF, D0, H1, Zeus– Bio, Pheno

• Expts: – Minos, Mice, SNO, UKQCD

• Theory users• …

11-13 October 2005


Tier 1 Hardware

• ~950 CPUs in batch service– 1.4GHz, 2.66GHz, 2.8GHz – P3 and P4/Xeon (HT off)– 1.0GHz systems retiring as they fail, phase out end Oct '05 – New procurement

• Aiming for 1400+ SPECint2000/CPU• Systems for testing as part of evaluation of tender• First delivery early '06, second delivery in April/May '06

• ~40 systems for services (FEs, RB, CE, LCG servers, loggers etc)

• 60+ disk servers– Mostly SCSI attached IDE or SATA ~220TB unformatted– New procurement: probably PCI/SATA solution

• Tape robot– 6K slots, 1.2PB, 10 drives

11-13 October 2005


Tape Robot / Data Store

• Current data: 300TB, PP -> 200+TB (110TB Babar)• Castor 1 system trials

– Many CERN-specifics• HSM (Hierarchical Storage Manager)

– 500TB, DMF (Data Management Facility)• SCSI/FC• Real file system• Data migrates to tape after inactivity• Not for PP data

– Due November 05• Procurement for a new robot underway

– 3PB, ~10 tape drives– Expect to order end Oct 05– Delivery December 05– In service by March 06 (for SC4)– Castor system

11-13 October 2005


Networking

• Tier-1 backbone at 4x1Gb/s– Upgrading some links to 10Gb/s

• Multi-port 10Gb/s layer-2 switch stack as hub when available• 1Gb/s production link Tier-1 to RAL site• 1Gb/s link to SJ4 (internet)

– 1Gb/s HW firewall• Upgrade site backbone to 10Gb/s expected late '05, early '06

– Link Tier-1 to site at 10Gb/s – possible mid-2006 – Link site to SJ5 @ 10Gb/s – mid '06

• Site firewall remains an issue – limit 4Gb/s• 2x1Gb/s link to UKLight

– Separate development network in UK – Links to CERN @ 2Gb/s, Lancaster @ 1Gb/s (pending)– Managed ~90MB/s during SC2, less since

• Problems with small packet loss causing traffic limitations– Tier-1 to UKLight upgrade to 4x1Gb/s pending,10Gb/s possible– UKLight link to CERN requested @ 4Gb/s for early '06– Over-running hardware upgrade (4 days expanded to 7 weeks)

11-13 October 2005


Tier1 Network Core – SC3

7i-1

7i-3

RouterA

UKLightRouter

5510-1

5510-2

ADS Caches

dCachepools

dCachepools

Gridftpservers

Non-SChosts

Non-SC hosts

4 x 1Gb/s

4 x 1Gb/s

4 x 1Gb/s

2 x 1Gb/s

2 x 1Gb/sto CERN

290Mb/sto Lancaster

N x 1Gb/s

N x 1Gb/s

FW1Gb/s 1Gb/s 1Gb/s

to SJ4

RALSite

11-13 October 2005


OS/Software

• Main services: – Batch, FEs, CE, RB… : SL3 (3.0.3, 3.0.4, 3.0.5)

• LCG 2_6_0• Torque/MAUI• 1 Job/CPU

– Disk: RH72 custom, RH73 custom– Some internal services on SL4 (loggers)– Project to use SL4.n for disk servers underway

• Solaris disk servers decommissioned– Most hardware sold

• AFS on AIX– Transarc– Project to move to Linux (SL3/4)

11-13 October 2005


Services (1) - Objyserv

• Objyserv database service (Babar)– Old service on traditional NFS server

• Custom NFS, heavily loaded, unable to cope with increased activity on batch farm due to threading issues in server

• Additional server solution with same technology not tenable– New service:

• Twin ams-based servers, 2 CPUs, HT on, 2 GB RAM • SL3, RAID1 data disks• 4 servers per host system

– Internal redirection using iptables to different server ports depending which of 4 IP addressed used to make the connection

• Able to cope with some ease: 600+ clients• Contact Chris Brew

11-13 October 2005


Services (2)Home file system

• Home file system migration– Old system:

• ~85GB on A1000 RAID array• Sun Ultra10, Solaris 2.6, 100Mb/s NIC• Failed to cope with some forms of pathological use

– New system:• ~270GB SCSI RAID5, 6 disk chassis• 2.4GHz Xeon, 1GB RAM, 1Gb/s NIC• SL3, ext3• Stable under I/O and quota testing, and during backup

– Migration:• 3 weeks planning• 1 week of nightly rsync followed by checksuming

– Convince ourselves the rsync works• 1 day farm shutdown to migrate• 1 single file detected to have checksum error

– Quotas for users unchanged…– Keep the old system on standby to restore its backups

11-13 October 2005


Services (3) – Batch Server

• Catastrophic disk failure on Saturday late evening over a holiday weekend– Staff not expected back till 8:30am Wednesday

• Problem noted Tuesday morning– Initial inspection - disk a total failure– No easy access to backups

• Backup tape numbers in logs on failed disk!– No easy recovery solution with no other system staff available– Jobs appear happy – terminating OK, sending sandboxes to gatekeeper etc. But no

accounting data, no new jobs started.• Wednesday:

– Hardware `revised’ with two disks, Software RAID1, clean install of SL3– Backups located, batch/scheduling configs recovered from tape store– System restarted with MAUI off to allow Torque to sort itself out

• Queues came up closed– MAUI restarted– Service picked up smoothly

• Lessons:– Know where the backups are and how to identify which tapes are the right ones– Unmodified batch workers are not good enough for system services

11-13 October 2005


Issues

• How to run resilient services on non-resilient hardware?– Committed to run 24x365, 98%+ uptime– Modified batch workers with extra disks and HS caddies as servers– Investigating HA-Linux

• Batch server and scheduling experiments positive• RB,CE, BDII, R-GMA …

– Databases• Building services maintenance

– Aircon, power• Already two substantial shutdowns in 2006• New building

• UKLight is a development project network– There have been problems with managing expectations for production

services on a development network• Unresolved packet loss in CERN-RAL transfers

– Under investigation• 10Gb/s kit expensive

– Components we would like are not yet affordable/available– Pushing against LCG turn-on date

Questions?

Documents

RAL Site Report Martin Bly SLAC 11-13 October 2005