AMOD report 24 – 30 September 2012

Preview:

DESCRIPTION

AMOD report 24 – 30 September 2012. Fernando H. Barreiro Megino CERN IT-ES. Workload. Data transfers. > 1M files a day. High number of transfer failures caused by a few NL T2s. Tue25 - High load on PanDA Servers. - PowerPoint PPT Presentation

Citation preview

www.egi.euEGI-InSPIRE RI-261323

EGI-InSPIRE

www.egi.euEGI-InSPIRE RI-261323

AMOD report 24 – 30 September 2012

Fernando H. Barreiro Megino

CERN IT-ES

1

www.egi.euEGI-InSPIRE RI-261323

Workload

2

www.egi.euEGI-InSPIRE RI-261323

Data transfers

High number of transfer failures

caused by a few NL

T2s

> 1M files a day

3

www.egi.euEGI-InSPIRE RI-261323

Tue25 - High load on PanDA Servers

• Average time for DQ2+LFC registration increased dramatically causing high load on PanDA Servers

• Some LFC timings in the logs indicated that the registration slowness was in DQ2

CC writer 1

CC writer 2Number of sessions open on ADCR3 instance. Mostly by ATLAS_LFC_W user

4

www.egi.euEGI-InSPIRE RI-261323

Tue25 - High load on PanDA Servers

• Other observations that came up during the investigation• Some improvements on the LFC client are going to be

discussed during “DB technical meeting on the LFC” on Wednesday 3rd Oct

• PanDA server LFC registration should be activated for all sites in order to avoid individual registrations by the pilot

• aCT registers in bursts without bulk methods: In the LFC logs we saw 4k accesses over 1 hour and only 7 access over another hour

• There were 2 SS machines serving the DE cloud (i.e. the same sites twice) with similar configuration

5

www.egi.euEGI-InSPIRE RI-261323

Thu27- SS callbacks to dashboard piling up

SS-FR

• Initially we thought it was exclusively due to the CERN network intervention• After checking the logs we have seen slow callbacks before the

intervention on different SS machines• D. Tuckett is checking the situation

6

www.egi.euEGI-InSPIRE RI-261323

Other incidents and downtimes

• Monday• New PanDA proxy had not been updated on PanDA Monitor machines (

Savannah: 97737)• INFN-T1 scheduled downtime for ~1 hour

• Tuesday• RAL 6h upgrade to CASTOR 2.1.12-10. Alastair set UK cloud brokeroff

on previous evening

• Thursday• CERN network intervention to replace some switches. Services under

risk were CASTOR, EOS, elog and dashboard. Smooth intervention - NTR.

• Friday• BNL to ASGC transfer errors. Being investigated by both sides during the

weekend. ASGC FTS is blocked to access BNL SRM and routing path is changed. (GGUS:86537)

7

www.egi.euEGI-InSPIRE RI-261323

Other incidents and downtimes (2)

• Sunday • PVSS DCS replication with large delays due to high insertion rate.

DCS expert had to be called on Sunday• RAL had failing jobs due to put errors and transfer errors –

including T0 export. Caused by problem with Stager databases and resolved during Sunday late evening(GGUS:86552)

• Saturday• SS-SARA had CRITICAL errors.

MySQL DB corruption? Problem to be understood by DDM experts.

8

www.egi.euEGI-InSPIRE RI-261323

Acknowledgements

• Except for occasional highlights it has been a very quiet week

• Thanks a lot to • ADCoS expert&shifters, and to the

Comp@P1 shifter for the good work• experts of the different components and

sites for the quick reaction• Alessandro, Ueda for their support

9

www.egi.euEGI-InSPIRE RI-261323

EGI-InSPIRE

www.egi.euEGI-InSPIRE RI-261323

Backup slides

10

www.egi.euEGI-InSPIRE RI-261323

NL transfer errors

11