Upload
wilfrid-price
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
UKI-SouthGrid Overview and Oxford Status Report
Pete GronbechSouthGrid Technical Coordinator
GridPP 24 - RHUL15th April 2010
SouthGrid September 20092
UK Tier 2 reported CPU
– Historical View to present
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
Jan-09
Feb-09
Mar-09
Apr-09
May-09
Jun-09
Jul-09
Aug-09
Sep-09
Oct-09
Nov-09
Dec-09
Jan-10
Feb-10
Mar-10
K SPEC int 2000 hours
UK-London-Tier2
UK-NorthGrid
UK-ScotGrid
UK-SouthGrid
SouthGrid September 20093
SouthGrid SitesAccounting as reported by
APEL
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
Jan-09
Feb-09
Mar-09
Apr-09
May-09
Jun-09
Jul-09 Aug-09
Sep-09
Oct-09
Nov-09
Dec-09
Jan-10
Feb-10
Mar-10
K SPEC int 2000 hours
JET
BHAM
BRIS
CAM
OX
RALPPD
Sites Upgrading to SL5 and recalibration of published SI2K values
RALPP seem low, even after my compensation for publishing 1000 instead of 2500
SouthGrid September 20094
Site Resources
HEPSPEC06
CPU (kSI2K) converted from
HEPSPEC06 benchmarks Storage (TB)
1772 442 1.5
3344 836 114
1836 459 110
1772 443 140
3332 833 160
12928 3232 633
0
24984 6246 1158.5
Site
EDFA-JET
Birmingham
Bristol
Cambridge
Oxford
RALPPD
Totals
SouthGrid September 20096
JET
• Stable operation, (SL5 WNs) quiet period over Christmas
• Could handle more opportunistic LHC work
1772HS06
1.5TB
SouthGrid September 20097
Birmingham
• WN’s been SL5 since Christmas. Have an ARGUS server (no SCAS). glexec_wn installed on the local worker nodes (not the shared cluster), so we *should* be able to run multiuser pilot jobs, but this has yet to be tested.
• We also have a Cream CE in production, but this does not use the ARGUS server yet
• Some problems on the HPC, GPFS has been unreliable.• Recent increases in CPU to 3344 HS06 and Disk to
114TB
SouthGrid September 20098
Bristol
• SL4 WN & ce01 retired, VMs ce03+04 brought online, giving 132 SL5 jobslots
• Metson Cluster also being used to evaluate Hadoop• More than doubled in CPU size to 1836 HS06 • Disk at 110 TB• Storm will be upgraded to 1.5 once the SL5 version is
stable
SouthGrid September 20099
Cambridge
• 32 new cpu cores added to bring total up to 1772 HS06
• 40TB of disk storage recently added bringing the total to 140TB
• All WNs upgraded to SL5, now investigating glexec• Atlas production making good use of this site
SouthGrid September 200910
RALPP
• Largest SouthGrid site• APEL accounting discrepancy now seems to be sorted. There
was a very hot GGUS ticket which has resulted in a Savannah bug, some corrections being made, and accounting should now have been corrected.
• Air conditioning woes caused a load of emergency downtime. We're expecting some more downtimes (due to further a/c, power and BMS issues) but these will be more tightly planned and managed.
• Currently running on <50% CPU due to a/c issues, some suspect disks on WNs, some WNs awaiting rehoming in other machine room.
• Memory upgrades for 40 WNs, so we will have either 320 job slots @ 4GB/slot or a smaller number of slots with higher memory. These are primarily intended for Atlas simulation work.
• A (fairly) modest amount of extra CPU and disk purchased at end of year coming online soonish.
SouthGrid September 200911
Oxford
• AC failure on 23rd December worst possible time– System more robust now– Better monitoring– Auto shutdown scripts (based on ipmi system temperature
monitoring)
• Following SL5 upgrade Autumn 09 cluster running very well.
• Faulty Network switch had been causing various timeouts, replaced.
• DPM reorganisation completed quickly once network fixed.
• Atlas Squid server installed• Preparing tender to purchase h/w with the 2nd tranche
of gridpp3 money
SouthGrid September 200913
Grid Cluster setup CREAM ce & pilot setup
t2ce02
CREAM
Glite 3.2 SL5
T2wn41glexec
enabled
t2scas01
t2ce06
CREAM
Glite 3.2 SL5
T2wn40 -87
Oxford
SouthGrid September 200914
Grid Cluster setup NGS integration setup
ngsce-test.oerc.ox.ac.uk
ngs.oerc.ox.ac.uk
wn40wn5x
wn6xwn7xwn8x
Oxford
ngsce-test is a Virtual Machine which has glite ce software installed.
The glite WN software is installed via a tar ball in an NFS shared area visible to all the WN’s.
PBSpro logs are rsync’ed to ngsce-test to allow the APEL accounting to match which PBS jobs were grid jobs.
Contributed 1.2% of Oxfords total work during Q1
SouthGrid September 200915
Operations Dashboard used by ROD is now Nagios based
SouthGrid September 200916
gridppnagios
Oxford runs the UKI Regional Nagios monitoring site. The Operations dashboard will take information from this in due course.
https://gridppnagios.physics.ox.ac.uk/nagios/
https://twiki.cern.ch/twiki/bin/view/LCG/GridServiceMonitoringInfo
SouthGrid September 200917
Oxford Tier-2 Cluster – Jan 2009
located at Begbroke. Tendering for upgrade.
Decommissioned January 2009
Saving approx 6.6KW
Originally installed April 04
1717thth November November 2008 Upgrade2008 Upgrade
26 servers = 208 Job Slots
60TB Disk
22 Servers = 176 Job Slots 100TB Disk Storage
SouthGrid September 200918
Grid Cluster Network setup
3com 5500
T2se0n – 20TB
Disk Pool Node
Worker Node
3com 5500
3com 5500
Backplane Stacking Cables 96Gbps full duplex
T2se0n – 20TB
Disk Pool NodeT2se0n – 20TB
Disk Pool Node
Dual Channel bonded 1 Gbps links to the storage nodes
Oxford
10 gigabit too expensive, so will maintain 1gigabit per ~10TB ratio with channel bonding in the new tender
SouthGrid September 200919
Production Mode
Sites have to be more vigilant than ever.
Closer monitoring
Faster response to problems
Proactive attitude to fixing problems before GGUS tickets arrive
Closer interaction with Main Experimental Users
Use the monitoring tools available:
SouthGrid September 200920
Atlas Monitoring
SouthGrid September 200921
PBSWEBMON
SouthGrid September 200922
Ganglia
SouthGrid September 200923
Command Line
• showq | more• pbsnodes –l• qstat –an• ont2wns df –hl
SouthGrid September 200924
Local Campus Network Monitoring
SouthGrid September 200925
Gridmon
SouthGrid September 200926
Patch levels – Pakitiv1 vs v2
SouthGrid September 200927
Monitoring tools etc
Site Pakiti Ganglia Pbswebmon Scas,glexec, argus
JET No Yes No No
Bham No Yes No No,yes,yes
Brist Yes, v1 Yes No No
Cam No Yes No No
Ox V1 production, v2 test
Yes Yes Yes, yes, no
RALPP V1 Yes No No (but started on scas)
SouthGrid September 200928
Conclusions
• SouthGrid sites utilisation improving• Many had recent upgrades others putting out tenders• Will be purchasing new hardware in gridpp3 second
tranche• Monitoring for production running improving