AMOD Weekly report (Ale, Alexei, Jarka) Doug Benjamin (AMOD shadow)

AMOD Weekly report(Ale, Alexei, Jarka)

Doug Benjamin(AMOD shadow)

Active week

Total jobs Completed (hourly) (analysis/Production)

40K

10 K

700 TB

100 TB

DDM data transfers (daily)

Tier 0 LFS running jobs

6k3k

Various ATLAS items

ATLAS Items to Follow up:

• FTS job not being dropped: add the conditions (error messages) in dq2.cfg – To be doneo FTS returns exit code #1 if command does not succeed

• SLS SS becomes orange/red only if Restarts>3 in 30 mins, o this cannot happen (because of the loop of DQ2). Cedric is aware,o Possible solutions: a) lower the threshold, i.e. 2 restarts, or b) increase the

time window - night shift Thursday/Friday: he/she missed 2 big issues. To be followed by shift coordinators

• FR cloud SS dq2 crash and restart frequentlyo Related to Multi-hop subscription – Experts pondering solution

• AGIS ToACache being tested in SS FT, one deletion agent (atlddm17), SS CERN(TW and IT) https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/36887

• ALARM (Wednesday) Tier0 submission bsub slow: 1. changing the network card (upgrade from 1to 10 Gb) of the master.2. changing the way in which the CE queries the LSF master, i.e. grouping the

queries from the CREAM CE . They think that this can reduce quite a lot 3. there is an open call to the LSF company (Platform, that is IBM company).

There is an expert looking at the problems right now.

https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/36887

Never a dull momentMonday:

o CVMFS cond db file issue: Doug and Misha contacted the morning, Misha on holidays, Doug flying to CERN, Misha on holidays, Doug flying to CERN, Doug fixed the problem lunch time (not during the flight ). • Instructions written for AMOD – shift coordinators need to make

sure they are in the instructions

Tuesday:o many ATLAS Central Services (running on VM) got crazy: due to a

glitch on the Hypervisors http://itssb.web.cern.ch/service-incident/cviclr04fc-cluster-failure/12-06-2012 -

o CERN-PROD ALARM: T0 Merge files not accessible (GGUS:83178)o FZK FTS Server Channel stuck (GGUS:83111) Site Admin talked with

FTS

Wednesday:o MC12*merge.TAG* replication issue: all the data were set to secondary.

Immediate actions: change metadata to primary, replicate all to CERN-PROD. Follow up: make sure this won't happen again (Datri modified for group production to datadisk), discuss number of replicas, etcetc. still to be "closed the loop".

o RAL – attempted Oracle11 upgrade – UK wide FTS outage – FTS move alternative database

http://itssb.web.cern.ch/service-incident/cviclr04fc-cluster-failure/12-06-2012

Never a dull moment(2)Thursday:

o CERN PROD LFS bsub sub slow – ALARM GGUS:83252o RAL Oracle11 upgrade did not make it, site rolled back. To be re-

scheduled. FTS server was returning for the jobs submitted before the start of the intervention exit code 256, which is the same of "no contact to FTS server", so ATLAS SS were polling all the already lost FTS jobs. Since this is definetely related to this aborted attempt of upgrade, we do not report further (i.e. no GGUS).

o A growing number of sites using CVMFS are seeing the occasional error with: [CVMFS sites] Error: cmtsite command was timed out : https://savannah.cern.ch/support/?129468

o There is an issue when a DDM SS Box spends too much time to scan the FTS jobs. If the site is far away, the problem is reached with less FTS jobs. Apparently, the problem is reached for 400 FTS jobs in TRIUMF FTS and 800-1000 for european T1s. DDM (Cedric) is trying to find the optimal way to avoid this.

Friday:o CERN-PROD : overnight SRM transfer failure CERN-PROD_TZERO, later

in morning stages successful, ticket closed: GGUS:8329o TRIUMF: FTS errors with proxy : GGUS:83293o FR site services – (see page 3)o DDM SS power cut overnight affected FZK SS box -

https://savannah.cern.ch/support/?129468

https://savannah.cern.ch/support/?129468

Finally the weekend Saturday/Sunday:

o IN2P3 FTS channels got stuck. GGUS:83320 solved: some channel agents did not recover the Oracle connexion after the logrotate at 4:00 AM due to a problem with Oracle virtual IPs. Solved by defining a new connection string which does not use Oracle virtual IPs.

o TRIUMF 1745 files lost.Files declared to the consistency service. Savannah:95440. Ticket will be updated when the exact number of lost files is confirmed.

o Extra T1-T1 subscriptions (some missing datasets)o All CERN resources given to Tier 0 processing

Data quality monitoring issues

Criticial directory fillsOur monitoring fails to catch it

OWL Monitoring/shift

operations

• Around 00:00 Saturday morning - Tier 0 – directory used for log file merging became full – stopping processing ( atypical occurrence)

• Browser used by shifter at P1 crashed an restarted at ~12:30 – normally update plots were wedged and warning icons were stuck green

• Lost Tier 0 processing for the night until shift change• OWL shifter is a distributed computing expert• Our shift monitoring not really designed for middle

for the night operation well people are most tired.My suggestion – review of all shift monitoring – with assumption shifters are tired – make it easier for them to spot serious errors

Many Thanks

• Senior AMOD’s (Ale, Jarka and Alexei)o Excellent trainers – they know how to crack the whip

• Comp@P1 shifters – to keep data flowing during the week

• ADCos Shifters and Experts • Tier 0 experts – making the data flow• Site admins – quickly recovering from the daily

hickups• ADC experts

Documents

AMOD Weekly report (Ale, Alexei, Jarka) Doug Benjamin (AMOD shadow)