UKI-SouthGrid Overview GridPP30 Pete Gronbech SouthGrid Technical Coordinator and GridPP Project Manager Glasgow - March 2012

UKI-SouthGrid OverviewGridPP30

Pete GronbechSouthGrid Technical Coordinator

and GridPP Project ManagerGlasgow - March 2012

Oct-11

Nov-1

1

Dec-1

1

Jan-

12

Feb-1

2

Mar

-12

Apr-1

2

May

-12

Jun-

12

Jul-1

2

Aug-1

2

Sep-1

2

Oct-12

Nov-1

2

Dec-1

2

Jan-

13

Feb-1

30

2000000

4000000

6000000

8000000

10000000

12000000

14000000

16000000

18000000

UK-London-Tier2

UK-NorthGrid

UK-ScotGrid

UK-SouthGrid

K SPEC int 2000 hours

SouthGrid March 20132

UK Tier 2 reported CPU

– Historical View to present

• Last reported stats at CERN in Sept 2011, so data since then.


SouthGrid SitesAccounting as reported by

APEL

Oct-11 Nov-11Dec-11 Jan-12 Feb-12 Mar-12 Apr-12May-12Jun-12 Jul-12 Aug-12Sep-12 Oct-12 Nov-12Dec-12 Jan-13 Feb-130

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

JET

BHAM

BRIS

CAM

OX

RALPPD

Sussex

K SPEC int 2000 hours

Arrival of Sussex as an Atlas enabled site

VO Usage


Usage dominated by LHC VOs.

6% Non LHC

Non LHC VOs

5

A wide range of ‘Other VOs’

Gridpp4 h/w generated MoU for 2012-14 from

Steve Lloyd2012 TB 2013 TB 2014 TB

bham 265 282 346

bris 96 96 116

cam 123 139 167

ox 409 456 549

Jet 19 21 26

RALPPD 830 862 1034

Total 1744 1857 2237

2012 HS06 2013 HS06 2014 HS06

bham 3375 3305 4184

bris 1651 1619 1982

cam 730 770 949

ox 3287 3372 4173

Jet 166 169 208

RALPPD 10082 9959 12243

6SouthGrid December 2012

Q412 Resources

Total available to GridPP

Site HEPSPEC06 Storage (TB)

EFDA JET 1772 10.5Birmingham 6288 255

Bristol 2247 117

Cambridge 2445 287

Oxford 11520 720

RALPP 26410 1260

Sussex 1381 60

Totals 53182 2709.5

Question of MoU’s

• New experimental requirements in Sept 2012 generated new increased shares.


Dave Britton generated MoU shown at GridPP29

2013 TB

bham 269

bris 68

cam 214

ox 567

RALPPD 890

Total 2008

2013 HS06

bham 2990

bris 1271

cam 528

ox 3685

RALPPD 13653

Total 22127

8SouthGrid December 2012

Q412 Resources

Total available to GridPP

Site HEPSPEC06 Storage (TB)

EFDA JET 1772 10.5

Birmingham 6288 255

Bristol 2247 117

Cambridge 2445 287

Oxford 11520 720

RALPP 26410 1260

Sussex 1381 60

Totals 53182 2709.5

Up 151TB

Up 2933 HS06

JET

• The site has been under used for the last 6 months.• This is a non Particle Physics site so all LHC work is a bonus.


• Essentially a pure CPU site– 1772 HepSPEC06– 10.5 Tb of storage

• All service nodes have been upgraded to EMI2• CVMFS has been setup and configured for LHCb and Atlas.• Could be utilised much more!!• Active non LHC VOs : Biomed, esr, fusion and Pheno

Birmingham Tier 2 Site


• Most active Non LHC VOs : Biomed and ILC.

• MS has helped a local Neuroscience group set up a VO, grid work to follow. (Some involvement with CERN@school)

• Mark Slater is the LHCb UK Operations rep/ shifter

• Major VO’s are Atlas , ALICE and LHCb.

• Middleware:

• DPM, WN: EMI 2, Everything else: UMD 1

• Complete overhaul of aircon in the last 12 months. One unit (14kW) left to be installed in the next couple of weeks (hopefully!)

• CVMFS fully installed.

• Now providing 110TB space for ALICE in their own xrootd area. No other xrootd/webDAV updates

Bristol

Status• StoRM SE upgraded to SL6 Storm 1.10. problematic at first but StoRM

developers helped with modify from default config. Helped debug why it was publishing 0 used, apparently known bug.

• Upgrades to EMI middleware has improved CMS site readiness• CVMFS setup for CMS, Atlas & LHCb• Onoing development of Hadoop SE: gridFTP + SRM server ready, set-up of

PhEDEx(Debug) in progress

• Active non LHC Vos : ILC

• Landslides VO work currently on hold.• Working with CMS to plan the best way forward for Bristol


Cambridge

• Status– CPU : 140 job slots, 1657 HS06– Storage : 277TB [si] – Most active non LHC Vos: Camont, but almost exclusively and Atlas /LHCb site.

• The Camtology/Imense work at Cambridge has essentially finished (we still host some of their kit)

• We have an involvement in a couple of Computational Radiotherapy projects (VoxTox and AccelRT) where they may possibly be some interest in using the Grid


RALPP

• SouthGrid’s biggest site.• Major VO’s CMS, Atlas and LHCb• Non LHC VOs – Biomed, ILC and esr.

• Planned migration to a new computer room this year, with six water cooled racks. Will try to minimise downtime. Other racks will move to the Atlas building.

• SE is dCache 1.9.12 – planning to upgrade to 2.2 in the near future.

• 20Gbit link between the two computer rooms. • Rob Harper is on the security team• New member of staff (Ian Loader) starting very soon.


Oxford

• Oxford’s workload is dominated by ATLAS analysis and production


• Most active non LHC VOs : esr, fusion, hone, pheno, t2k and zeus

• Recent Upgrades– We have a 10Gbit link to the JANET router but is currently rate capped at

5Gbps.– Oxford will get a second 10Gbit line enabled soon and then the rate cap can

get lifted

• SouthGrid Support– Providing support for Bristol, Sussex and JET– The Landslides VO supported at Oxford and Bristol– Helped bring Sussex onto the Grid as an Atlas site

• Oxford Particle Physics Masterclasses with Grid Computing talk.

Other Oxford Work• CMS Tier 3

– Supported by RALPPD’s PhEDEx server. Now configured to use CVMFS and xrootd as the local file access protocol

– Useful for CMS, and for us, keeping the site busy in quiet times– However can block Atlas jobs so during accounting period max running jobs limit applied– Largest non CMS Tier-2 site in UK

• ALICE Support– The ALICE computational requirements are shared between Birmingham and Oxford.

• UK Regional Monitoring– Kashif runs the nagios based WLCG monitoring on the servers at Oxford– These include the Nagios server itself, and support nodes for it, SE, MyProxy and WMS/LB– KM also remotely manages the failover instance at Lancaster– There are very regular software updates for the WLCG Nagios monitoring.

• VOMS server replication at Oxford (and IC)

• Early Adopters– In the past were official early adopters for testing of CREAM, ARGUS and torque_utils. Recently have

tested early in a less official way.


Multi VO Nagios Monitoring

• Monitoring three VO’s– T2k.org– Snoplus.snolab.ca– Vo.southgrid.ac.uk

• Customized to suit small VO’s – Only using direct job submission– Test jobs submitted every 8 hours– Test jobs can stay in queue for 6 hours before being cancelled

by Nagios

• Using VO-feed for topology information– VO-feed is also hosted at Oxford– Fine grained control of what to monitor– But it requires manual changes

• Probably it is the first production Multi-VO Nagios instance in egi– Found many bugs– It worked very well with two VO’s but after adding third

VO it showing some strange issues.– Opened a GGUS ticket and working on it

Multi VO Nagios Monitoring

GridPP Cloud work at Oxford

• Working to provide an OpenStack based cloud infrastructure at Oxford

• OpenStack Folsom release installed with the help of Cobbler and Puppet– Using RHEL 6.4 – Most of installation and configuration is automated

• Started with three machines– One controller node running OpenStack core services– One Compute node running Nova-compute and Nova-network– One storage node to provide block storage and NFS mount for glance images

• Plan to add more compute nodes in future• We are open to provide our infrastructure for testing

– https://www.gridpp.ac.uk/wiki/GridPP_Cloud_sites

https://www.gridpp.ac.uk/wiki/GridPP_Cloud_sites

https://www.gridpp.ac.uk/wiki/GridPP_Cloud_sites

XrootD and WebDAV

• Xrootd and WebDAV on DPM• Completely separate services, but similar minimum

requirements, so if you can do XrootD, you can do WebDAV• Local XrootD is pretty well organised, federations currently

require name lookup libraries which lack a good distribution mechanism.

• Configuration is mostly boiler plate – copy ours:


# Federated xrootdDPM_XROOTD_FEDREDIRS="atlas-xrd-uk.cern.ch:1094:1098,atlas,/atlas xrootd.ba.infn.it:1094:1213,cms,/store"

# Atlas federated xrootdDPM_XROOTD_FED_ATLAS_NAMELIBPFX="/dpm/physics.ox.ac.uk/home/atlas"DPM_XROOTD_FED_ATLAS_NAMELIB="XrdOucName2NameLFC.so root=/dpm/physics.ox.ac.uk/home/atlas match=t2se01.physics.ox.ac.uk"DPM_XROOTD_FED_ATLAS_SETENV="LFC_HOST=prod-lfc-atlas-ro.cern.ch LFC_CONRETRY=0 GLOBUS_THREAD_MODEL=pthread CSEC_MECH=ID"

# CMS federated xrootdDPM_XROOTD_FED_CMS_NAMELIBPFX="/dpm/physics.ox.ac.uk/home/cms"DPM_XROOTD_FED_CMS_NAMELIB="libXrdCmsTfc.so file:/etc/xrootd/storage.xml?protocol=xroot"

# General local xrootdDPM_XROOTD_SHAREDKEY=“bIgl0ngstr1ng0FstuFFi5l0ng"DPM_XROOTD_DISK_MISC="xrootd.monitor all rbuff 32k auth flush 30s window 5s dest files info user io redir atl-prod05.slac.stanford.edu:9930if exec xrootdxrd.report atl-prod05.slac.stanford.edu:9931 every 60s all -buff -poll syncfi"DPM_XROOTD_REDIR_MISC="$DPM_XROOTD_DISK_MISC"DPM_XROOTD_FED_ATLAS_MISC="$DPM_XROOTD_DISK_MISC"

XrootD• The main source of documentation is here:

https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Xroot/Setup• We’ve also switched to using xrootd for ATLAS and CMS local file access (which is a VO side

change, not a site one), but this isn’t making use of the federation yet.• ATLAS are currently using XrootD based file stager copies, not XrootD direct IO. We do hope

to try that too.• All the xrootd traffic currently gets reported to the ATLAS FAX monitoring, which made the

graphs look a bit odd when we turned it on:


https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Xroot/Setup

https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Xroot/Setup

WebDAV


# DPM webdavDPM_DAV="yes" # Enable DAV accessDPM_DAV_NS_FLAGS="Write" # Allow write access on the NS nodeDPM_DAV_DISK_FLAGS="Write" # Allow write access on the disk nodesDPM_DAV_SECURE_REDIRECT="On" # Enable redirection from head to disk using plain HTTP.

• WebDAV is simpler:

• Commodity clients work, but nothing supports everything you’d want.

• More here:https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/WebDAV/ClientTutorial

https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/WebDAV/ClientTutorial




• Sussex has a significant local ATLAS group, their system is designed for the high IO bandwidth patterns that ATLAS analysis can generate.

• EMI2 middleware installed• CVMFS installed and configured• Setup as an Atlas production site running jobs in anger since February 2013• New 64 core node arriving in the next week and a further 128 cores and 120TB to be

added in the summer.

• SNO+ is expected to start using the site shortly.• JANET link scheduled to be upgraded from current 2Gb (plus 1Gb resilient failover) to

10Gb in Autumn 2013.

Sussex



Conclusions

• SouthGrid seven sites well utilised, but some sites small compared with others.

• Birmingham supporting Atlas, Alice and LHCb.• Bristol; Have upgraded to the latest version of STORM, and EMI

middleware and have been available to CMS since mid December. Hope to be better utilised by CMS now. Local funding will be used to enhance the CPU and storage.

• Cambridge; size was reduce when Condor part of the cluster decommissioned, local funding to be used to boost capacity.

• JET continue to be available as a CPU site but very little storage available. However CVMFS and all middleware at EMI2. Should be a useful MC site for LHCb

• Oxford wide involvement in many VOs and areas of development and GridPP infrastructure.

• RALPPD remain SouthGrid s largest site and a major CMS contributor.

• Sussex; successfully running as an Atlas site with some local upgrades planned for this year.

Documents

UKI-SouthGrid Overview GridPP30 Pete Gronbech SouthGrid Technical Coordinator and GridPP Project Manager Glasgow - March 2012