26
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

Embed Size (px)

Citation preview

Page 1: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

UKI-SouthGrid Overview and Oxford Status Report

Pete GronbechSouthGrid Technical Coordinator

GridPP 24 - RHUL15th April 2010

Page 2: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 20092

UK Tier 2 reported CPU

– Historical View to present

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

Jan-09

Feb-09

Mar-09

Apr-09

May-09

Jun-09

Jul-09

Aug-09

Sep-09

Oct-09

Nov-09

Dec-09

Jan-10

Feb-10

Mar-10

K SPEC int 2000 hours

UK-London-Tier2

UK-NorthGrid

UK-ScotGrid

UK-SouthGrid

Page 3: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 20093

SouthGrid SitesAccounting as reported by

APEL

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

Jan-09

Feb-09

Mar-09

Apr-09

May-09

Jun-09

Jul-09 Aug-09

Sep-09

Oct-09

Nov-09

Dec-09

Jan-10

Feb-10

Mar-10

K SPEC int 2000 hours

JET

BHAM

BRIS

CAM

OX

RALPPD

Sites Upgrading to SL5 and recalibration of published SI2K values

RALPP seem low, even after my compensation for publishing 1000 instead of 2500

Page 4: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 20094

Site Resources

HEPSPEC06

CPU (kSI2K) converted from

HEPSPEC06 benchmarks Storage (TB)

1772 442 1.5

3344 836 114

1836 459 110

1772 443 140

3332 833 160

12928 3232 633

0    

24984 6246 1158.5

Site

EDFA-JET

Birmingham

Bristol

Cambridge

Oxford

RALPPD

 

Totals

Page 5: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 20096

JET

• Stable operation, (SL5 WNs) quiet period over Christmas

• Could handle more opportunistic LHC work

1772HS06

1.5TB

Page 6: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 20097

Birmingham

• WN’s been SL5 since Christmas. Have an ARGUS server (no SCAS). glexec_wn installed on the local worker nodes (not the shared cluster), so we *should* be able to run multiuser pilot jobs, but this has yet to be tested.

• We also have a Cream CE in production, but this does not use the ARGUS server yet

• Some problems on the HPC, GPFS has been unreliable.• Recent increases in CPU to 3344 HS06 and Disk to

114TB

Page 7: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 20098

Bristol

• SL4 WN & ce01 retired, VMs ce03+04 brought online, giving 132 SL5 jobslots

• Metson Cluster also being used to evaluate Hadoop• More than doubled in CPU size to 1836 HS06 • Disk at 110 TB• Storm will be upgraded to 1.5 once the SL5 version is

stable

Page 8: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 20099

Cambridge

• 32 new cpu cores added to bring total up to 1772 HS06

• 40TB of disk storage recently added bringing the total to 140TB

• All WNs upgraded to SL5, now investigating glexec• Atlas production making good use of this site

Page 9: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200910

RALPP

• Largest SouthGrid site• APEL accounting discrepancy now seems to be sorted. There

was a very hot GGUS ticket which has resulted in a Savannah bug, some corrections being made, and accounting should now have been corrected.

• Air conditioning woes caused a load of emergency downtime. We're expecting some more downtimes (due to further a/c, power and BMS issues) but these will be more tightly planned and managed.

• Currently running on <50% CPU due to a/c issues, some suspect disks on WNs, some WNs awaiting rehoming in other machine room.

• Memory upgrades for 40 WNs, so we will have either 320 job slots @ 4GB/slot or a smaller number of slots with higher memory. These are primarily intended for Atlas simulation work.

• A (fairly) modest amount of extra CPU and disk purchased at end of year coming online soonish.

Page 10: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200911

Oxford

• AC failure on 23rd December worst possible time– System more robust now– Better monitoring– Auto shutdown scripts (based on ipmi system temperature

monitoring)

• Following SL5 upgrade Autumn 09 cluster running very well.

• Faulty Network switch had been causing various timeouts, replaced.

• DPM reorganisation completed quickly once network fixed.

• Atlas Squid server installed• Preparing tender to purchase h/w with the 2nd tranche

of gridpp3 money

Page 11: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200913

Grid Cluster setup CREAM ce & pilot setup

t2ce02

CREAM

Glite 3.2 SL5

T2wn41glexec

enabled

t2scas01

t2ce06

CREAM

Glite 3.2 SL5

T2wn40 -87

Oxford

Page 12: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200914

Grid Cluster setup NGS integration setup

ngsce-test.oerc.ox.ac.uk

ngs.oerc.ox.ac.uk

wn40wn5x

wn6xwn7xwn8x

Oxford

ngsce-test is a Virtual Machine which has glite ce software installed.

The glite WN software is installed via a tar ball in an NFS shared area visible to all the WN’s.

PBSpro logs are rsync’ed to ngsce-test to allow the APEL accounting to match which PBS jobs were grid jobs.

Contributed 1.2% of Oxfords total work during Q1

Page 13: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200915

Operations Dashboard used by ROD is now Nagios based

Page 14: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200916

gridppnagios

Oxford runs the UKI Regional Nagios monitoring site. The Operations dashboard will take information from this in due course.

https://gridppnagios.physics.ox.ac.uk/nagios/

https://twiki.cern.ch/twiki/bin/view/LCG/GridServiceMonitoringInfo

Page 15: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200917

Oxford Tier-2 Cluster – Jan 2009

located at Begbroke. Tendering for upgrade.

Decommissioned January 2009

Saving approx 6.6KW

Originally installed April 04

1717thth November November 2008 Upgrade2008 Upgrade

26 servers = 208 Job Slots

60TB Disk

22 Servers = 176 Job Slots 100TB Disk Storage

Page 16: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200918

Grid Cluster Network setup

3com 5500

T2se0n – 20TB

Disk Pool Node

Worker Node

3com 5500

3com 5500

Backplane Stacking Cables 96Gbps full duplex

T2se0n – 20TB

Disk Pool NodeT2se0n – 20TB

Disk Pool Node

Dual Channel bonded 1 Gbps links to the storage nodes

Oxford

10 gigabit too expensive, so will maintain 1gigabit per ~10TB ratio with channel bonding in the new tender

Page 17: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200919

Production Mode

Sites have to be more vigilant than ever.

Closer monitoring

Faster response to problems

Proactive attitude to fixing problems before GGUS tickets arrive

Closer interaction with Main Experimental Users

Use the monitoring tools available:

Page 18: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200920

Atlas Monitoring

Page 19: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200921

PBSWEBMON

Page 20: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200922

Ganglia

Page 21: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200923

Command Line

• showq | more• pbsnodes –l• qstat –an• ont2wns df –hl

Page 22: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200924

Local Campus Network Monitoring

Page 23: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200925

Gridmon

Page 24: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200926

Patch levels – Pakitiv1 vs v2

Page 25: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200927

Monitoring tools etc

Site Pakiti Ganglia Pbswebmon Scas,glexec, argus

JET No Yes No No

Bham No Yes No No,yes,yes

Brist Yes, v1 Yes No No

Cam No Yes No No

Ox V1 production, v2 test

Yes Yes Yes, yes, no

RALPP V1 Yes No No (but started on scas)

Page 26: UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010

SouthGrid September 200928

Conclusions

• SouthGrid sites utilisation improving• Many had recent upgrades others putting out tenders• Will be purchasing new hardware in gridpp3 second

tranche• Monitoring for production running improving