23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre

23 Oct 2002 HEPiX FNAL John Gordon

CLRC-RAL Site Report

John Gordon

CLRC eScience Centre

• General PP Facilities

• New UK Supercomputer

• BaBar TierA Centre

• Networking

Computing Farm

This years new hardware consists of 4 racks holding 156 dual cpu PCs, a total of 312 1.4GHz Pentium III Tualatin cpus. Each box has 1GB of memory, a 40GB internal disk and 100Mb ethernet.

Inside The Tape RobotThe tape robot was upgraded last year and now uses 60GB STK 9940 tapes. It currently holds 45TB but could hold 330TB when full.

UK GridPP Tier1/A Centre at CLRC

40TByte disk-based Disk Farm

The new mass storage unit can store 40Tb or raw data after a RAID 5 overhead. The PCs are clustered on network switches with up to 8x1000Mbit ethernet out of each rack.

Prototype Tier 1 centre for CERN LHC and FNAL experiments

Tier A centre for SLAC BaBar experiment

Testbed for EU DataGrid project

CSF Linux Accounts : Financial Year 2001/02

antaresw17%

h124%

theory15%

sno15%

bfactory3%

zeus11%

cms5%

lhcb2%

delphiuk8%

CSF Linux Accounts : Since April 2002

delphiuk4%

lhcb1%

atlas13%

antaresw5%

theory14%

cms2%

h18%

bfactory46%

sno3%

zeus4%

Free and outof robot

Free

DataStore User Data totals

H128%

CMS10%

ALEPH9%

CSFSERV7%

PDKDATA6%

DELPHI6%

CBAR5%

ATLAS5%

LHCB5%

SNO3%

PDKDST3%

BABAR2%

BABARDB2%

MINOS2%

CSFRUTH1%

BJS1%

OPAL1%

FUNNEL1%

DE20%JTB0%

BABAR4210%

CDFBACK0%

ANTARESW0%

STASSINA0%

PEARCE0%

ICFAMON0%

DPS620%

DMN0%

GPPTB0%

AXPRL10%PJL0%

WA940%

PPUX0%

SZYMANSK0%

PDKISIS0%

DARK0%

COLLINSM0%

TINKERBE0%

CSFDEV0%

VRVS0%

RAS0%

SGEORGE0%

ADYE0%

LARTEY0%

BCE0%

BFACTORY0%

JHCR0%

SYSLOG10%JCH0%

DCO0%

ZOU0%

MDW20%

JHCR940%

H1

CMS

ALEPH

CSFSERV

PDKDATA

DELPHI

CBAR

ATLAS

LHCB

SNO

PDKDST

BABAR

BABARDB

MINOS

CSFRUTH

BJS

OPAL

FUNNEL

DE2

JTB

BABAR421

CDFBACK

ANTARESW

STASSINA

PEARCE

ICFAMON

DPS62

DMN

GPPTB

AXPRL1

PJL

36TB HEP Data

HPCx

• UK SuperComputer for next 6 years– Collaboration of CLRC Daresbury Laboratory, Edinburgh EPCC. IBM

• Sited at CLRC-DL• http://www.hpcx.ac.uk

• Double in performance every 2 years ie 2 upgrades• Capability computing

– Target to get 50% of the jobs using 50% of the machine

• Hardware– 40x32 IBM pSeries 690 Regatta-H nodes (Power4 CPUs) – 1280 1.3GHz cpus – estimated peak performance 6.6TeraFLOPS– IBM Colony switch connects blocks of 8 cpus (ie looks like 160x8, not

40x32)– 1280 GB of Memory– 2x32 already in place as a migration aid.– Service testing mid November, service December.

HPCx

• Software– Capability computing on around 1000 high performance CPUs

– Terascale Applications team

• Parallelising applications for 1000’s of CPUs

– Different architecture compared to T3E etc

– HPCx is a cluster of 32 Processor machines compared to MPP style of T3E

– Some MPI operations now very slow (eg barriers, all-to-all communications)

RAL Tier ARAL Tier A

• RAL is TierA Centre for BaBar– Like CC-IN2P3 but concentrating on different data.

– Shared resource with LHC and other experiments

– Use

Hardware

• 104 “noma”-like machines allocated to BaBar– 156+old farm shared with other experiments

– 6 BaBar Suns (4-6 CPUs each)

• 20 TB disk for BaBar– Also using ~10 TB of pool disk for data transfers

– All disk servers on Gigabit ethernet

– Pretty good server performance

• … as well as existing RAL facilities– 622 Mbits/s network to SLAC and elsewhere

– AFS cell

– 100TB Tape robot

– Many years’ experience running BaBar software

Problems

• Disk problems tracked down to a bad batch of drives– All drives are now being replaced by the manufacturer

• our disks should be done in ~1 month

– By using spare servers, replacement shouldn’t interrupt service

• Initially suffered from lack of support staff and out-of-hours support (for US hours)– Two new system managers now in post

– Two more being recruited (one just for BaBar)

– Additional staff have been able to help with problems at weekends

– Discussing more formal arrangements

RAL Batch CPU Use

0

20,000

40,000

60,000

80,000

100,000

Week Beginning

CP

U H

ou

rs p

er W

eek

(No

rmal

ised

to

P45

0)

SPUK UsersNon-UK Users

Full usage at full efficiency of BaBar CPUs = 106,624 Hours/Week; 59,733 according to MOU

RAL Batch Users(running at least one non-trivial job each week)

0

5

10

15

20

25

30

Week Beginning

Use

rs p

er W

eek

Non-UK UsersUK Users

A total of 113 new BaBar users registered since December

Data at RAL

• All data in Kanga format is at RAL– 19 TB currently on disk

• Series-8 + series-10 + reskimmed series-10

• AllEvents + streams

• data + signal+generic MC

• New data copied from SLAC within 1-2 days

• RAL is now the primary Kanga analysis site– New data is archived to tape at SLAC and then deleted from disk

Changes since July

• Two new RedHat 6 front-end machines– Dedicated to BaBar use– Login to babar.gridpp.rl.ac.uk

• Trial RedHat 7.2 service– One front-end and (currently) 5 batch workers– Once we are happy with the configuration, many/all of the rest of the

batch workers will be rapidly upgraded

• ssh AFS token passing installed on front-ends– So, your local (eg. SLAC) token is available when you log in

• Trial Grid Gatekeeper available (EDG 1.2)– Allows job submission from the Grid

• Improved new user registration procedures

Plans

• Upgrade full farm to RedHat 7.2– Leave RedHat 6 front-end for use with older releases

• Upgrade Suns to Solaris 8 and integrate into PBS queues

• Install data dedicated import-export machines– Fast (Gigabit) network connection

– Special firewall rules to allow scp, bbftp, bbcp, etc.

• AFS authentication improvements– PBS token passing and renewal

– integrated login (AFS token on login, like SLAC)

Plans

• Objectivity support– Works now for private federations, but no data import

• Support Grid “generic accounts”, so special RAL user registration is no longer necessary

• Procure next batch of hardware– Delivery probably early 2003

Network

• Tier1 internal networking will be a hybrid of – 100Mb to nodes of cpu farms with 1Gb up from switches– 1Gb to disk servers– 1Gb to tape servers

• UK academic network SuperJANET4 – 2.5Gbit backbone upgrading to 10Gb in 2002

• RAL 622Mb into SJ4 upgraded to 2.5Gb June 02• SJ4 has 2.5Gb interconnect to Geant• 2.5Gb links to ESnet and Abilene just for research users• UK involved in networking development

– internal with Cisco on QoS– external with DataTAG

• Lamda CERN -> Starlight• Private connections

Documents

23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre