Upload
helena-newman
View
213
Download
1
Embed Size (px)
Citation preview
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH1
On-line Computing M&O
LHCC RRB SG 16 Sep 2004
P. Vande Vyvre CERN/PHfor 4 LHC DAQ project leaders
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH2
Introduction
• Questions raised by RRB Scrutiny Group:– System managers profiles– Number of system managers– M&O budget category– Replacement profile of computer/network equipment
• Common answer from 4 LHC experiments
• See also presentation by A. Ceccucci to RRB SG in April 2003 on M&O for Online Computing
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH3
System managers profilesCategory Function Qualification
Level-1 - React to alarms- Follow predefined procedures- 24/7 operational
Experience and knowledge of computers
Level-2 - Install/Update systems/services- Configure/Monitor - Piquet service
Same as above + 2 years experience Unix shell scripts etc.
Supervisor Overall supervision and direction of these tasks
Informatics professional
• Continuity is needed for Level-2 and supervisor personnel
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH4
System management effort (1)• Estimates based on LCG guidelines: fixed number of
boxes (PC, network switch, storage element) per system manager
• Differences between online and offline systems:– Wide variety of equipment used as a single system– Various PCs with different configurations (trigger farms,
dataflow, control, monitoring, file servers)– Variety compounded by staged procurements– Very large and highly loaded network (event building e.g.)– Failure of any part of the online system will reduce efficiency
of data-taking partially (loss of HLT sub-farm e.g.) or will interrupt data taking (failure of central controller) i.e. we have to run a complete coherent system
• Dedicated team with appropriate skills needed to ensure reliability and optimal capacity of the online systems
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH5
System management effort (2)• Manpower from collaboration ?
– LHC collaborations are very large but attempts to find suitably qualified effort for system manager have failed even to meet today’s needs
– Most people (physicists, engineers) do not have the right profile– Institutes who have people with proper qualifications not
prepared to locate them at CERN for adequate periods• Full operation
– 24/7 cover at Level-1, normal working hours at Level-2 + service piquet
– At least 5 people Level-1 and 5 people Level-2. Reduced by some overlap
– Shift crew will contribute to Level-1• Provisional estimates to be adapted (2008-9) following
experience of running the system and a better knowledge of the system reliability
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH6
System management effort (3)
2005 2006 2007 2008 2009
ALICE 2 (1+1)
3 (2+1)
4 (3+1)
5 (4+1)
5(4+1)
ATLAS 2 (1+1)
3 (2+1)
5 (4+1)
8 (6+2)
9 (7+2)
CMS 1.5(1+0.5)
3 (2+1)
7 (5+2)
8 (6+2)
10 (8+2)
LHCb 2 (1.5+0.5)
3 (2.5+0.5)
5 (4+1)
6(5+1)
6(5+1)
Total effort in FTEs(Level1 and Level2 + Supervisor)
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH7
M&O budget category
• M&O A
• Request of CERN management and RRB
• No other identified source
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH8
Replacement of equipment (1) • Equipment: PCs, network, and storage used
for dataflow and online trigger• Motivations:
– Reliability of equipment as it ages– Maintainability after a few years (3 years warranty)– Suitability of old equipment to follow evolution of
operating system and to work with new equipment– Need to follow Operating System (OS) evolution:
• Security patches• New PCs (staged installation) not supported by old OS
versions • Old OS versions not supported• Code will continue to be developed with dependencies
on the OS and compiler versions• Online trigger code based/using offline code developed
for current OS version
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH9
Replacement of equipment (2)• Categories
– Disk and fileservers: lower reliability and very rapid evolution. 3 years
– PCs: 4 years• Replacement cost will not directly follow Moore’s
Law: I/O performance limitations, new multi-core architecture might require major increase in system memory
– Network• Central switch: 5 years (= period of maintenance
by manufacturer)• Smaller peripheral switches: 4 years (shorter
warranty but less critical)
LHCC RRB SG 16 Sep. 2004 P. Vande Vyvre CERN-PH10
Previous practice
• LEP and fixed target era:– Computers were complete systems qualified by a
commercial company– Maintenance contract to paid by experiments– System managers in experiments (some CERN staff)– CERN had operators staff in the computing center and in
groups giving support to experiments
• LHC era:– Components tested, qualified and assembled into complete
systems by the experiments– Overall system much larger and complex than previously– Very few operators at CERN directly employed by CERN