Northgrid Status Alessandra Forti Gridpp25 Ambleside 25 August 2010

Northgrid Status

Alessandra Forti

Gridpp25 Ambleside

25 August 2010

Outline

• Apel pies• Lancaster status• Liverpool status• Manchester status• Sheffield status• Conclusions

Apel pie (1)

Apel pie (2)

Apel pie (3)

Lancaster

• The new shared HPC facility at Lancaster.

– 1760 cores, delivering over 20k HEPSPEC2k6• Behind an LSF batch system.

• In a proper machine room (although that's had teething troubles).

– Added bonus of also providing a home for future GridPP kit.

• Access to a “Panasus Shelf” for a high performance tarball and VO software areas.

• 260 TB Grid-only storage included in purchase.

• Roger Jones is managing director and we have admin access to the cluster, so we have a strong voice in this joint venture.

– But with root access comes root responsibility.

Lancaster

• New ``HEC'' facility almost ready.

– Final stages of acceptance testing.

– Number of gotchas and niggles causing some delays, but this is expected due to the facilities scale.• Underspecced fuses, overspecced UPS,

overheating distribution boards, misbehaving switches, oh my!

• Site otherwise chugging along nicely (with some exceptions).

• New hardware to replace older servers

– Site needs a downtime in the near future to get these online and have a general clean up.

• Tendering for additional +400TB storage

Liverpool:

• Addition to Cluster

– 32 bit cluster replaced by 9 x Quad chassis (4 worker nodes in each), boosts 64bit capability, reduces power consumption

– Each node has 2 x X5620 CPUs (4 core), 2 x hot swap SATA, IPMI, redundant power supply

– Runs 64bit SL5, glite 3.2

• Entire Cluster

– CPU:

• 6 x Quad core worker nodes (56 cores)

• 16 x Quad chassis, 4 WNs each (512 cores)

• 3 GB RAM per core

• 8214.44 HEPSPEC06

• Storage: 286 Terabytes

• Plans: Migrate from VMWare to KVM (lcg-CE, CREAM-CE, Site BDII, MON Box ...)

Liverpool: problem

• https://gus.fzk.de/ws/ticket_info.php?ticket=61224

• Site looks as not “not available” in gstat because GSTAT/Nagios doesn't recognise RedHat Enterprise Server as OS

• OS was correctly advertised following the procedure published How_to_publish_the_OS_name

• The ticket has been reassigned to gstat SU but the problem has affected the site nagios tests.

https://gus.fzk.de/ws/ticket_info.php?ticket=61224

Manchester: new hardware

• Procurement at commissioning stage

– New hardware will replace half of the current cluster

• Computing power

– 20 quad chassis

– 2x6 core per motherboard

– 13.7 HEPSPEC06 per core

– 4GB memory per core

– 125 GB disk space per core

– 2 Gb/s bonded link per mother board

– Total: 960 cores = 13.15k HEPSPEC06

• Storage

– 9x36bay units

– 30+2+1x2TB for the raid (30 data disks + 2 parity disks + 1 hot spare)

– 2x250GB for the OS

– Total: 540 TB usable

Manchester: new hardware

• Computing nodes arrived and currently running on site soak tests

– Testing opportuninty to use ext4 as file system rather than ext3

• Initial test with iozone on old nodes show a visible improvement especially with multithreaded tests

• Single thread better in writing only

• Need to start testing new nodes

• Storage will be delivered this week

– Raid card had a disk detection problem.

– Issue is now solved by a new version of the firmware.

• Rack switches undergoing reconfiguration to allow

– 4x2Gb/s bonded links from quad chassis

– 1x4Gb/s bonded link from storage units

– 8xGb/s uplink to the main cisco

Manchester: general news

• Current stability

– Consistently delivered a high number of CPU hours

– Current storage is stable although slowed down by SL5(.3?)/XFS problem that causes bursts of load on the machine

• Evaluating how to tackle the problem for the new storage.

• Management

– Stefan Soldner Rembold, Mike Seymour and Un-ki Yang will take over from Roger Barlow the management of the Tier2.

Sheffield: major cluster upgrade

• We have built a new cluster in Physics Department with 24/7 access. It is a join cluster gridpp/local group with a common torque server

• Storage

– Storage Nodes running DPM 1.7.4, recently upgraded to SL5.5. Tests show latest SL5 kernels no longer show xfs bug

– DPM head node 8 cores and 16 GB RAM

– 200 TB of disk space have been deployed – covering the pledge for 2010 according to the new GRIDPP requirements

– SW RAID5 (no raid controllers)

– All disk pools assembled in Sheffield 2 TB disks seagate barracuda and one 2TB Western digital

– 5x16bay unit with 2 fs, 2x24 bay unit with 2 fs

– Cold spare unit on standby near each server

– 95% isk space reserved for Atlas

– Plan to add 50 TB late in the autumn


• Infrastructure

– Additional 32A ring mains added to the machine room in the physics department

– Fiber links connecting servers in physics to WN in CICS

– Old 5kW aircon replaced with a new 10kW (3x10kW aircon units)

– Dedicated WAN link to cluster

• Torque Server

– Accepts jobs from grid CE and local cluster

– Sends jobs in to all Wns

– Hosts DNS server

• CE

– SL4.8 4 single AMD opteron processor 850, 8GB if RAM, redundant power supply, 72 GB scsi in a raid1

– MONBOX and BDII


• WN

– 50 WNs in Physics Department (32 from local hep cluster + 18 new), 5 Gb backbone

– Phenom 3200 MHz x86_64, 8 GB of RAM, 140GB/4cores, 11.96 HepSpec/core

– 102 old WNs in CICS, 1 Gb backbone

– 204 single Core 2.4 GHz Opterons (2GB), 4GB of RAM , 72 GB local disk per2Cores; 7.9 HepSpec/core; connected to the servers via fibre link

– Jobs requiring greater network bandwidth directed to WNs with better backbone

– software server with 1 TB disk (RAID1) and squid server were moved from CICS

• Cluster availability and reliability in July is 100%

• Sheffield is active in atlas production and user analysis

Conclusion

• Northgrid has been pretty stable and steadily crunching CPU hours for the past 4 months and data.

• Sites have got or are in the process of getting new hardware rejuvenating the CPUs and increasing the storage

• Both the CPUs and storage MoU requirements for Northgrid should be satisfied in excess in the near future.

Documents

Northgrid Status Alessandra Forti Gridpp25 Ambleside 25 August 2010