16
Northgrid Status Alessandra Forti Gridpp25 Ambleside 25 August 2010

Northgrid Status Alessandra Forti Gridpp25 Ambleside 25 August 2010

Embed Size (px)

Citation preview

Northgrid Status

Alessandra Forti

Gridpp25 Ambleside

25 August 2010

Outline

• Apel pies• Lancaster status• Liverpool status• Manchester status• Sheffield status• Conclusions

Apel pie (1)

Apel pie (2)

Apel pie (3)

Lancaster

• The new shared HPC facility at Lancaster.

– 1760 cores, delivering over 20k HEPSPEC2k6• Behind an LSF batch system.

• In a proper machine room (although that's had teething troubles).

– Added bonus of also providing a home for future GridPP kit.

• Access to a “Panasus Shelf” for a high performance tarball and VO software areas.

• 260 TB Grid-only storage included in purchase.

• Roger Jones is managing director and we have admin access to the cluster, so we have a strong voice in this joint venture.

– But with root access comes root responsibility.

Lancaster

• New ``HEC'' facility almost ready.

– Final stages of acceptance testing.

– Number of gotchas and niggles causing some delays, but this is expected due to the facilities scale.• Underspecced fuses, overspecced UPS,

overheating distribution boards, misbehaving switches, oh my!

• Site otherwise chugging along nicely (with some exceptions).

• New hardware to replace older servers

– Site needs a downtime in the near future to get these online and have a general clean up.

• Tendering for additional +400TB storage

Liverpool:

• Addition to Cluster

– 32 bit cluster replaced by 9 x Quad chassis (4 worker nodes in each), boosts 64bit capability, reduces power consumption

– Each node has 2 x X5620 CPUs (4 core), 2 x hot swap SATA, IPMI, redundant power supply

– Runs 64bit SL5, glite 3.2

• Entire Cluster

– CPU:

• 6 x Quad core worker nodes (56 cores)

• 16 x Quad chassis, 4 WNs each (512 cores)

• 3 GB RAM per core

• 8214.44 HEPSPEC06

• Storage: 286 Terabytes

• Plans: Migrate from VMWare to KVM (lcg-CE, CREAM-CE, Site BDII, MON Box ...)

Liverpool: problem

• https://gus.fzk.de/ws/ticket_info.php?ticket=61224

• Site looks as not “not available” in gstat because GSTAT/Nagios doesn't recognise RedHat Enterprise Server as OS

• OS was correctly advertised following the procedure published How_to_publish_the_OS_name

• The ticket has been reassigned to gstat SU but the problem has affected the site nagios tests.

Manchester: new hardware

• Procurement at commissioning stage

– New hardware will replace half of the current cluster

• Computing power

– 20 quad chassis

– 2x6 core per motherboard

– 13.7 HEPSPEC06 per core

– 4GB memory per core

– 125 GB disk space per core

– 2 Gb/s bonded link per mother board

– Total: 960 cores = 13.15k HEPSPEC06

• Storage

– 9x36bay units

– 30+2+1x2TB for the raid (30 data disks + 2 parity disks + 1 hot spare)

– 2x250GB for the OS

– Total: 540 TB usable

Manchester: new hardware

• Computing nodes arrived and currently running on site soak tests

– Testing opportuninty to use ext4 as file system rather than ext3

• Initial test with iozone on old nodes show a visible improvement especially with multithreaded tests

• Single thread better in writing only

• Need to start testing new nodes

• Storage will be delivered this week

– Raid card had a disk detection problem.

– Issue is now solved by a new version of the firmware.

• Rack switches undergoing reconfiguration to allow

– 4x2Gb/s bonded links from quad chassis

– 1x4Gb/s bonded link from storage units

– 8xGb/s uplink to the main cisco

Manchester: general news

• Current stability

– Consistently delivered a high number of CPU hours

– Current storage is stable although slowed down by SL5(.3?)/XFS problem that causes bursts of load on the machine

• Evaluating how to tackle the problem for the new storage.

• Management

– Stefan Soldner Rembold, Mike Seymour and Un-ki Yang will take over from Roger Barlow the management of the Tier2.

Sheffield: major cluster upgrade

• We have built a new cluster in Physics Department with 24/7 access. It is a join cluster gridpp/local group with a common torque server

• Storage

– Storage Nodes running DPM 1.7.4, recently upgraded to SL5.5. Tests show latest SL5 kernels no longer show xfs bug

– DPM head node 8 cores and 16 GB RAM

– 200 TB of disk space have been deployed – covering the pledge for 2010 according to the new GRIDPP requirements

– SW RAID5 (no raid controllers)

– All disk pools assembled in Sheffield 2 TB disks seagate barracuda and one 2TB Western digital

– 5x16bay unit with 2 fs, 2x24 bay unit with 2 fs

– Cold spare unit on standby near each server

– 95% isk space reserved for Atlas

– Plan to add 50 TB late in the autumn

Sheffield: major cluster upgrade

• Infrastructure

– Additional 32A ring mains added to the machine room in the physics department

– Fiber links connecting servers in physics to WN in CICS

– Old 5kW aircon replaced with a new 10kW (3x10kW aircon units)

– Dedicated WAN link to cluster

• Torque Server

– Accepts jobs from grid CE and local cluster

– Sends jobs in to all Wns

– Hosts DNS server

• CE

– SL4.8 4 single AMD opteron processor 850, 8GB if RAM, redundant power supply, 72 GB scsi in a raid1

– MONBOX and BDII

Sheffield: major cluster upgrade

• WN

– 50 WNs in Physics Department (32 from local hep cluster + 18 new), 5 Gb backbone

– Phenom 3200 MHz x86_64, 8 GB of RAM, 140GB/4cores, 11.96 HepSpec/core

– 102 old WNs in CICS, 1 Gb backbone

– 204 single Core 2.4 GHz Opterons (2GB), 4GB of RAM , 72 GB local disk per2Cores; 7.9 HepSpec/core; connected to the servers via fibre link

– Jobs requiring greater network bandwidth directed to WNs with better backbone

– software server with 1 TB disk (RAID1) and squid server were moved from CICS

• Cluster availability and reliability in July is 100%

• Sheffield is active in atlas production and user analysis

Conclusion

• Northgrid has been pretty stable and steadily crunching CPU hours for the past 4 months and data.

• Sites have got or are in the process of getting new hardware rejuvenating the CPUs and increasing the storage

• Both the CPUs and storage MoU requirements for Northgrid should be satisfied in excess in the near future.