10
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH 1 On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

Embed Size (px)

Citation preview

Page 1: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH1

On-line Computing M&O

LHCC RRB SG 16 Sep 2004

P. Vande Vyvre CERN/PHfor 4 LHC DAQ project leaders

Page 2: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH2

Introduction

• Questions raised by RRB Scrutiny Group:– System managers profiles– Number of system managers– M&O budget category– Replacement profile of computer/network equipment

• Common answer from 4 LHC experiments

• See also presentation by A. Ceccucci to RRB SG in April 2003 on M&O for Online Computing

Page 3: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH3

System managers profilesCategory Function Qualification

Level-1 - React to alarms- Follow predefined procedures- 24/7 operational

Experience and knowledge of computers

Level-2 - Install/Update systems/services- Configure/Monitor - Piquet service

Same as above + 2 years experience Unix shell scripts etc.

Supervisor Overall supervision and direction of these tasks

Informatics professional

• Continuity is needed for Level-2 and supervisor personnel

Page 4: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH4

System management effort (1)• Estimates based on LCG guidelines: fixed number of

boxes (PC, network switch, storage element) per system manager

• Differences between online and offline systems:– Wide variety of equipment used as a single system– Various PCs with different configurations (trigger farms,

dataflow, control, monitoring, file servers)– Variety compounded by staged procurements– Very large and highly loaded network (event building e.g.)– Failure of any part of the online system will reduce efficiency

of data-taking partially (loss of HLT sub-farm e.g.) or will interrupt data taking (failure of central controller) i.e. we have to run a complete coherent system

• Dedicated team with appropriate skills needed to ensure reliability and optimal capacity of the online systems

Page 5: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH5

System management effort (2)• Manpower from collaboration ?

– LHC collaborations are very large but attempts to find suitably qualified effort for system manager have failed even to meet today’s needs

– Most people (physicists, engineers) do not have the right profile– Institutes who have people with proper qualifications not

prepared to locate them at CERN for adequate periods• Full operation

– 24/7 cover at Level-1, normal working hours at Level-2 + service piquet

– At least 5 people Level-1 and 5 people Level-2. Reduced by some overlap

– Shift crew will contribute to Level-1• Provisional estimates to be adapted (2008-9) following

experience of running the system and a better knowledge of the system reliability

Page 6: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH6

System management effort (3)

2005 2006 2007 2008 2009

ALICE 2 (1+1)

3 (2+1)

4 (3+1)

5 (4+1)

5(4+1)

ATLAS 2 (1+1)

3 (2+1)

5 (4+1)

8 (6+2)

9 (7+2)

CMS 1.5(1+0.5)

3 (2+1)

7 (5+2)

8 (6+2)

10 (8+2)

LHCb 2 (1.5+0.5)

3 (2.5+0.5)

5 (4+1)

6(5+1)

6(5+1)

Total effort in FTEs(Level1 and Level2 + Supervisor)

Page 7: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH7

M&O budget category

• M&O A

• Request of CERN management and RRB

• No other identified source

Page 8: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH8

Replacement of equipment (1) • Equipment: PCs, network, and storage used

for dataflow and online trigger• Motivations:

– Reliability of equipment as it ages– Maintainability after a few years (3 years warranty)– Suitability of old equipment to follow evolution of

operating system and to work with new equipment– Need to follow Operating System (OS) evolution:

• Security patches• New PCs (staged installation) not supported by old OS

versions • Old OS versions not supported• Code will continue to be developed with dependencies

on the OS and compiler versions• Online trigger code based/using offline code developed

for current OS version

Page 9: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH9

Replacement of equipment (2)• Categories

– Disk and fileservers: lower reliability and very rapid evolution. 3 years

– PCs: 4 years• Replacement cost will not directly follow Moore’s

Law: I/O performance limitations, new multi-core architecture might require major increase in system memory

– Network• Central switch: 5 years (= period of maintenance

by manufacturer)• Smaller peripheral switches: 4 years (shorter

warranty but less critical)

Page 10: 1 LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders

LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH10

Previous practice

• LEP and fixed target era:– Computers were complete systems qualified by a

commercial company– Maintenance contract to paid by experiments– System managers in experiments (some CERN staff)– CERN had operators staff in the computing center and in

groups giving support to experiments

• LHC era:– Components tested, qualified and assembled into complete

systems by the experiments– Overall system much larger and complex than previously– Very few operators at CERN directly employed by CERN