AMOD report 24 – 30 September 2012

www.egi.euEGI-InSPIRE RI-261323

EGI-InSPIRE


AMOD report 24 – 30 September 2012

Fernando H. Barreiro Megino

CERN IT-ES

1


Workload

2


Data transfers

High number of transfer failures

caused by a few NL

T2s

> 1M files a day

3


Tue25 - High load on PanDA Servers

• Average time for DQ2+LFC registration increased dramatically causing high load on PanDA Servers

• Some LFC timings in the logs indicated that the registration slowness was in DQ2

CC writer 1

CC writer 2Number of sessions open on ADCR3 instance. Mostly by ATLAS_LFC_W user

4


Tue25 - High load on PanDA Servers

• Other observations that came up during the investigation• Some improvements on the LFC client are going to be

discussed during “DB technical meeting on the LFC” on Wednesday 3rd Oct

• PanDA server LFC registration should be activated for all sites in order to avoid individual registrations by the pilot

• aCT registers in bursts without bulk methods: In the LFC logs we saw 4k accesses over 1 hour and only 7 access over another hour

• There were 2 SS machines serving the DE cloud (i.e. the same sites twice) with similar configuration

5


Thu27- SS callbacks to dashboard piling up

SS-FR

• Initially we thought it was exclusively due to the CERN network intervention• After checking the logs we have seen slow callbacks before the

intervention on different SS machines• D. Tuckett is checking the situation

6


Other incidents and downtimes

• Monday• New PanDA proxy had not been updated on PanDA Monitor machines (

Savannah: 97737)• INFN-T1 scheduled downtime for ~1 hour

• Tuesday• RAL 6h upgrade to CASTOR 2.1.12-10. Alastair set UK cloud brokeroff

on previous evening

• Thursday• CERN network intervention to replace some switches. Services under

risk were CASTOR, EOS, elog and dashboard. Smooth intervention - NTR.

• Friday• BNL to ASGC transfer errors. Being investigated by both sides during the

weekend. ASGC FTS is blocked to access BNL SRM and routing path is changed. (GGUS:86537)

7

https://savannah.cern.ch/bugs/?97737

https://ggus.eu/ws/ticket_info.php?ticket=86537


Other incidents and downtimes (2)

• Sunday • PVSS DCS replication with large delays due to high insertion rate.

DCS expert had to be called on Sunday• RAL had failing jobs due to put errors and transfer errors –

including T0 export. Caused by problem with Stager databases and resolved during Sunday late evening(GGUS:86552)

• Saturday• SS-SARA had CRITICAL errors.

MySQL DB corruption? Problem to be understood by DDM experts.

8

https://ggus.eu/ws/ticket_info.php?ticket=86552


Acknowledgements

• Except for occasional highlights it has been a very quiet week

• Thanks a lot to • ADCoS expert&shifters, and to the

Comp@P1 shifter for the good work• experts of the different components and

sites for the quick reaction• Alessandro, Ueda for their support

9


EGI-InSPIRE


Backup slides

10


NL transfer errors

11

Documents

AMOD report 24 – 30 September 2012