Upload
beatrix-mcdowell
View
217
Download
1
Embed Size (px)
Citation preview
Outline
• Apel pies• Lancaster status• Liverpool status• Manchester status• Sheffield status• Conclusions
Lancaster
• The new shared HPC facility at Lancaster.
– 1760 cores, delivering over 20k HEPSPEC2k6• Behind an LSF batch system.
• In a proper machine room (although that's had teething troubles).
– Added bonus of also providing a home for future GridPP kit.
• Access to a “Panasus Shelf” for a high performance tarball and VO software areas.
• 260 TB Grid-only storage included in purchase.
• Roger Jones is managing director and we have admin access to the cluster, so we have a strong voice in this joint venture.
– But with root access comes root responsibility.
Lancaster
• New ``HEC'' facility almost ready.
– Final stages of acceptance testing.
– Number of gotchas and niggles causing some delays, but this is expected due to the facilities scale.• Underspecced fuses, overspecced UPS,
overheating distribution boards, misbehaving switches, oh my!
• Site otherwise chugging along nicely (with some exceptions).
• New hardware to replace older servers
– Site needs a downtime in the near future to get these online and have a general clean up.
• Tendering for additional +400TB storage
Liverpool:
• Addition to Cluster
– 32 bit cluster replaced by 9 x Quad chassis (4 worker nodes in each), boosts 64bit capability, reduces power consumption
– Each node has 2 x X5620 CPUs (4 core), 2 x hot swap SATA, IPMI, redundant power supply
– Runs 64bit SL5, glite 3.2
• Entire Cluster
– CPU:
• 6 x Quad core worker nodes (56 cores)
• 16 x Quad chassis, 4 WNs each (512 cores)
• 3 GB RAM per core
• 8214.44 HEPSPEC06
• Storage: 286 Terabytes
• Plans: Migrate from VMWare to KVM (lcg-CE, CREAM-CE, Site BDII, MON Box ...)
Liverpool: problem
• https://gus.fzk.de/ws/ticket_info.php?ticket=61224
• Site looks as not “not available” in gstat because GSTAT/Nagios doesn't recognise RedHat Enterprise Server as OS
• OS was correctly advertised following the procedure published How_to_publish_the_OS_name
• The ticket has been reassigned to gstat SU but the problem has affected the site nagios tests.
Manchester: new hardware
• Procurement at commissioning stage
– New hardware will replace half of the current cluster
• Computing power
– 20 quad chassis
– 2x6 core per motherboard
– 13.7 HEPSPEC06 per core
– 4GB memory per core
– 125 GB disk space per core
– 2 Gb/s bonded link per mother board
– Total: 960 cores = 13.15k HEPSPEC06
• Storage
– 9x36bay units
– 30+2+1x2TB for the raid (30 data disks + 2 parity disks + 1 hot spare)
– 2x250GB for the OS
– Total: 540 TB usable
Manchester: new hardware
• Computing nodes arrived and currently running on site soak tests
– Testing opportuninty to use ext4 as file system rather than ext3
• Initial test with iozone on old nodes show a visible improvement especially with multithreaded tests
• Single thread better in writing only
• Need to start testing new nodes
• Storage will be delivered this week
– Raid card had a disk detection problem.
– Issue is now solved by a new version of the firmware.
• Rack switches undergoing reconfiguration to allow
– 4x2Gb/s bonded links from quad chassis
– 1x4Gb/s bonded link from storage units
– 8xGb/s uplink to the main cisco
Manchester: general news
• Current stability
– Consistently delivered a high number of CPU hours
– Current storage is stable although slowed down by SL5(.3?)/XFS problem that causes bursts of load on the machine
• Evaluating how to tackle the problem for the new storage.
• Management
– Stefan Soldner Rembold, Mike Seymour and Un-ki Yang will take over from Roger Barlow the management of the Tier2.
Sheffield: major cluster upgrade
• We have built a new cluster in Physics Department with 24/7 access. It is a join cluster gridpp/local group with a common torque server
• Storage
– Storage Nodes running DPM 1.7.4, recently upgraded to SL5.5. Tests show latest SL5 kernels no longer show xfs bug
– DPM head node 8 cores and 16 GB RAM
– 200 TB of disk space have been deployed – covering the pledge for 2010 according to the new GRIDPP requirements
– SW RAID5 (no raid controllers)
– All disk pools assembled in Sheffield 2 TB disks seagate barracuda and one 2TB Western digital
– 5x16bay unit with 2 fs, 2x24 bay unit with 2 fs
– Cold spare unit on standby near each server
– 95% isk space reserved for Atlas
– Plan to add 50 TB late in the autumn
Sheffield: major cluster upgrade
• Infrastructure
– Additional 32A ring mains added to the machine room in the physics department
– Fiber links connecting servers in physics to WN in CICS
– Old 5kW aircon replaced with a new 10kW (3x10kW aircon units)
– Dedicated WAN link to cluster
• Torque Server
– Accepts jobs from grid CE and local cluster
– Sends jobs in to all Wns
– Hosts DNS server
• CE
– SL4.8 4 single AMD opteron processor 850, 8GB if RAM, redundant power supply, 72 GB scsi in a raid1
– MONBOX and BDII
Sheffield: major cluster upgrade
• WN
– 50 WNs in Physics Department (32 from local hep cluster + 18 new), 5 Gb backbone
– Phenom 3200 MHz x86_64, 8 GB of RAM, 140GB/4cores, 11.96 HepSpec/core
– 102 old WNs in CICS, 1 Gb backbone
– 204 single Core 2.4 GHz Opterons (2GB), 4GB of RAM , 72 GB local disk per2Cores; 7.9 HepSpec/core; connected to the servers via fibre link
– Jobs requiring greater network bandwidth directed to WNs with better backbone
– software server with 1 TB disk (RAID1) and squid server were moved from CICS
• Cluster availability and reliability in July is 100%
• Sheffield is active in atlas production and user analysis
Conclusion
• Northgrid has been pretty stable and steadily crunching CPU hours for the past 4 months and data.
• Sites have got or are in the process of getting new hardware rejuvenating the CPUs and increasing the storage
• Both the CPUs and storage MoU requirements for Northgrid should be satisfied in excess in the near future.