Upload
margery-scott
View
215
Download
2
Embed Size (px)
Citation preview
Draft of talk to be given in Madrid: CSC Operations Summary
Greg RaknessUniversity of California, Los Angeles
CMS Run Coordination WorkshopCIEMAT, Madrid
3 Nov 2011
https://indico.cern.ch/conferenceDisplay.py?confId=156187
CSC downtime at the beginning and end of 2011 running…
• CSC downtime reduced after fix of CSC response to Resync– Since 21 July, CSC no longer has data corruption problems
• Since 21 July, biggest cause of downtime by CSC’s is from SEU’s – Currently, this makes CSC go into TTS=Out-of-sync… – … fixed by the DAQ shifter pushing TTCHardReset
• CSC to do: make the correct request for SEU recovery – Should go to TTS=Error
3 Nov 2011 G. Rakness (UCLA) 2
11 March – 21 JulyInt. lumi = 1.3/fb… CSC downtime = 13/nb
21 July – 30 OctInt. lumi = 3.9/fb… CSC downtime = 7/nb
G. Rakness (UCLA) 3
Preparation for high data rates
• CSC data event size – With ALCT zero-suppression, CSC event size would be
reduced by ~15-20% • Requires firmware update
– Under discussion to implement during Year-End Tech. Stop• https://hypernews.cern.ch/HyperNews/CMS/get/csc-ops/2057.ht
ml
– If CSC needs still more bandwidth, it is possible to increase the number of CSC SLinks to 16 or 36
• HV trip threshold to be increased in inner rings– See next slide for more on HV…
3 Nov 2011
4
Automation plan: CSC Expert System• Goal: reduce load on CSC DOC by using the CSC Expert System
• First successful implementation: recovery from HV channel trips– 2010: HV trips had to be reset by DCS shifter– 2011: HV trips handled automatically by the Expert System
• Next goal: automated firmware maintenance– Recall daily task: CSC DOC reloads the program on CFEB’s and ALCT’s
whose EEPROM’s have flipped a bit• Problem discovered in 2008• Recovery procedure by the DOC has been reduced to a few clicks…
– Example numbers: between Aug 2010 – Aug 2011…• ~33% of ALCT’s needed to be touched (~150 ALCT’s)• ~13% of CFEB’s needed to be touched (~300 CFEB’s)• See https://indico.cern.ch/conferenceDisplay.py?confId=160517
– Timescale: during 2012 run3 Nov 2011 G. Rakness (UCLA)
G. Rakness (UCLA) 5
Fraction of active CSC channels
HV distribution board died
D-link board failure caused loss of the configurations of a few chambers
3 Nov 2011
Diligence will be needed to preserve the high fraction of active CSC channels throughout 2012
G. Rakness (UCLA) 6
A few items with impact beyond CSC
3 Nov 2011
G. Rakness (UCLA) 7
Question: if a subdetector has data corruption (e.g., sync-lost draining), is it possible to get a dump of the offending event?
3 Nov 2011
G. Rakness (UCLA) 8
Aging online computers
• All the VME control computers in CMS are more than 4 years old– Expect them to die at any moment…
• To facilitate the reinstallation of online software, CSC installs its RPM’s with quattor
3 Nov 2011
G. Rakness (UCLA) 9
“You’re calling the wrong DOC”• The following scenario has happened twice: a
calorimeter is misconfigured, causing… L1A rate = 105kHz (pre-deadtime rate = 700kHz) CSCTF in 80% Warning (nobody can run at 700kHz) Shift crew calls CSC…
3 Nov 2011
L1 Trigger rate = 105kHz
Pre-deadtime rate = 700kHz
Deadtime = 80%
DAQ backpressure = False
Might be worth to make one screen that has this information…
(green if OK, red if bad)
Monitor timing more regularly
3 Nov 2011 10
• CSC = average wire time per chamber, averaged over all chambers• ECAL = average EB time per crystal, averaged over all crystals
= ECAL – 0.325ns= CSC
Aver
age
Tim
e (n
sec) CSC and ECAL see
similar changes in timing
CSC needs to automate the monitoring our timing…
G. Rakness (UCLA)
G. Rakness (UCLA) 11
Summary of CSC Operations• Highlights from 2011 running include…
– High fraction of live channels– Small amount of deadtime– Automatic HV trip recovery
• Preparations for 2012 running include…– Correct request for SEU recovery– Reduction of event size– Automation of EEPROM reload
• Some “CSC worries” which are “CMS worries,” too…– Diagnosing (rare) data corruption– Worry about age of VME control computers– Calling the right DOC at the right time– Monitoring of timing 3 Nov 2011
G. Rakness (UCLA) 12
Backup
3 Nov 2011
G. Rakness (UCLA) 13
Goals for workshop• Review the experience from the 2011 operations for the proton running
and to plan for operations in 2012 • We are asking speakers to comment on the following topics (as
relevant):– Experience with beam operation in 2011 – Summary of major data loss sources– Readiness for high pile up, ~25ns, and 6e33 luminosity– Readiness for 25 ns– Requests for CMS re-commissioning before LHC start up, e.g. cosmics– Plans for 2012 operation– Availability of experts– Are you affected by SEUs? If so, how and what is required to recover– Experience with shifts, on-call load– Are you using standard tools (quattor) for software maintenance?
3 Nov 2011