Tier1 Report HEPSysMan @ Cambridge 23rd October 2006 Martin Bly

Tier1 Report

HEPSysMan @ Cambridge23rd October 2006

Martin Bly

23 October 2006 HEPSysMan @ Cambridge

Overview

• Tier-1• Hardware changes• Services


RAL Tier-1

• RAL hosts the UK WLCG Tier-1– Funded via GridPP2 project from PPARC– Supports WLCG and UK Particle Physics users and

collaborators• VOs:

– LHC: Atlas, CMS, LHCb, Alice, (dteam, ops)– Babar CDF, D0, H1, Zeus– bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno,

t2k, …• Other experiments:

– Mice, SNO, UKQCD• Theory users• …


Staff / Finance

• Bid to PPARC for ‘GridPP3’ project– For exploitation phase of LHC– September 2007 to March 2011– Increase in staff and hardware resources– Result early 2007

• Tier-1 is recruiting– 2 x systems admins, 1 x hardware technician– 1 x grid deployment – Replacement for Steve Traylen to head grid deployment

and user support group• CCLRC internal reorganisation

– Business Units• Tier1 service is run by E-Science department which is now part

of the Facilities Business Unit (FBU)


New building• Funding approved for a new computer centre building

– 3 floors• Computer rooms on ground floor, offices above

– 240m2 low power density room• Tape robots, disk servers etc• Minimum heat density 1.0 kW/m2, rising to 1.6kW/m2 by 2012

– 490m2 high power density room• Servers, CPU farms, HPC clusters

» Minimum heat density 1.8kW/m2, rising to 2.8Kw/m2 by 2012

– UPS computer room• 8 racks + 3 telecoms racks• UPS system to provide continuous power of 400A/92KVA three phase for

equipment plus power to air conditioning (total approx 800A/184KVA)– Overall

• Space for 300 racks (+ robots, telecoms)• Power: 2700kVA initially, max 5000kVA by 2012 (inc air-con)• UPS capacity to meet estimated 1000A/250KVA for 15-20 minutes for specific

hardware for clean shutdown / surviving short breaks– Shared with HPC and other CCLRC computing facilities– Planned to be ready by summer 2008


Hardware changes

• FY05/06 capacity procurement March 06– 52 x 1U twin dual-core AMD 270 units

• Tyan 2882 motherboard• 4GB RAM, 250GB SATA HDD, dual 1GB NIC• 208 job slots, 200kSI2K• Commissioned May 06, running well

– 21 x 5U 24-bay disk servers• 168TB (210TB) data capacity• Areca 1170 PCI-X 24-port controller• 22 x 400GB (500GB) SATA data drives, RAID 6 • 2 x 250GB SATA system drives, RAID 1• 4GB RAM, dual 1GB NIC• Commissioning delayed (more…)


Hardware changes (2)• FY 06/07 capacity procurements

– 47 x 3U 16-bay disk servers: 282TB data capacity• 3Ware 9550SX-16ML PCI-X 16-port SATA RAID controller• 14 x 500GB SATA data drives, RAID 5 • 2 x 250GB SATA system drives, RAID 1• Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC• Delivery expected October 06

– 64 x 1U twin dual-core Intel Woodcrest 5130 units (550kSI2K)• 4GB RAM, 250GB SATA HDD, dual 1GB NIC• Delivery expected November 06

• Upcoming in FY 06/07:– Further 210TB disk capacity expected December 06

• Same spec as above– High Availability systems with UPS

• Redundant PSUs, hot-swap paired HDDs etc– AFS replacement– Enhancement to Oracle services (disk arrays or RAC servers)


Hardware changes (3)

• SL8500 tape robot– Expanded from 6,000 to 10,000 slots– 10 drives shared between all users of

service– Additional 3 x T10K tape drives for PP– More when CASTOR service working

• STK Powderhorn– Decommissioned and removed


Storage commissioning

• Problems with March 06 procurement:– WD4000YR on Areca 1170, RAID 6

• Many instances of multiple drive dropouts• Un-warranted drive dropouts and then re-integrating the same

drive– Drive electronics (ASIC) on 4000YR (400GB) units changed

with no change of model designation• We got the updated units

– Firmware updates to Areca cards did not solve the issues– WD5000YS (500GB) units swapped-in by WD

• Fixes most issues but…– Status data and logs from drives showing several additional

problems• Testing under high load to gather statistics

– Production further delayed


Air-con issues• Setup

– 13 x 80KW units in lower machine room, several paired units work together• Several ‘hot’ days (for the UK) in July

– Sunday: dumped ~70 jobs• Alarm system failed to notify operators• Pre-emptive automatic shutdown not triggered• Ambient air temp reached >35C, machine exhaust temperature >50C !• HPC services not so lucky

– Mid week 1: problems over two days• attempts to cut load by suspending batch services to protect data services• forced to dump 270 jobs

– Mid week 2: 2 hot days predicted• pre-emptive shutdown of batch services in lower machine room• no jobs lost, data services remain available

• Problem– High ambient air temperature tripped high pressure cut-outs in refrigerant gas circuits– Cascade failure as individual air-con units work harder– Loss of control of machine room temperature

• Solutions– Sprinklers under units

• Successful but banned due to Health and Safety concerns – Up-rated refrigerant gas pressure settings to cope with higher ambient air temperature


Operating systems

• Grid services, batch workers, service machines– SL3, mainly 3.0.3, 3.0.5, 4.2, all ix86– SL4 before Xmas

• Considering x86_64• Disk storage

– SL4 migration in progress• Tape systems

– AIX: caches– Solaris: controller– SL3/4: CASTOR systems, newer caches

• Oracle systems– RHEL3/4

• Batch system– Torque/MAUI

• Fare-shares, allocation by User Board


Databases

• 3D project– Participating since early days

• Single Oracle server for testing• Successful

– Production service• 2 x Oracle RAC clusters

– Two servers per RAC» Redundant PSUs, hot-swap RAID1 system drives

– Single SATA/FC data array– Some transfer rate issues– UPS to come


Storage Resource Management

• dCache– Performance issues

• LAN performance very good• WAN performance and tuning problems

– Stability issues– Now better:

• increased number of open file descriptors• increased number of logins allowed.

• ADS– In-house system many years old

• Will remain for some legacy services• CASTOR2

– Replace both dCache disk and tape SRMs for major data services– Replace T1 access to existing ADS services– Pre-production service for CMS– LSF for transfer scheduling


Monitoring

• Nagios– Production service implemented– 3 servers (1 master + 2 slaves)– Almost all systems covered

• 600+

– Replacing SURE– Add call-out facilities


Networking• All systems have 1Gb/s connections

– Except oldest fraction of the batch farm• 10GB/s links almost everywhere

– 10Gb/s backbone within Tier-1• Complete November 06• Nortel 5530/5510 stacks

– 10Gb/s link to RAL site backbone• 10Gb/s backbone links at RAL expected end November 06• 10Gb/s link to RAL Tier-2

– 10Gb/s link to UK academic network SuperJanet5 (SJ5)• Expected in production by end of November 06• Firewall still an issue

– Planned bypass for Tier1 data traffic as part of RAL<->SJ5 and RAL backbone connectivity developments

– 10Gb/s OPN link to CERN active• September 06• Using pre-production SJ5 circuit• Production status at SJ5 handover


Security

• Notified of intrusion at Imperial College London• Searched logs

– Unauthorised use of account from suspect source– Evidence of harvesting password maps– No attempt to conceal activity– Unauthorised access to other sites– No evidence of root compromise

• Notified sites concerned– Incident widespread

• Passwords changed– All inactive accounts disabled

• Cleanup– Changed NIS to use shadow password map– Reinstall all interactive systems

Questions?

Documents

Tier1 Report HEPSysMan @ Cambridge 23rd October 2006 Martin Bly