Upload
brooke-gentry
View
33
Download
4
Embed Size (px)
DESCRIPTION
AMOD report 24 – 30 September 2012. Fernando H. Barreiro Megino CERN IT-ES. Workload. Data transfers. > 1M files a day. High number of transfer failures caused by a few NL T2s. Tue25 - High load on PanDA Servers. - PowerPoint PPT Presentation
Citation preview
www.egi.euEGI-InSPIRE RI-261323
EGI-InSPIRE
www.egi.euEGI-InSPIRE RI-261323
AMOD report 24 – 30 September 2012
Fernando H. Barreiro Megino
CERN IT-ES
1
www.egi.euEGI-InSPIRE RI-261323
Workload
2
www.egi.euEGI-InSPIRE RI-261323
Data transfers
High number of transfer failures
caused by a few NL
T2s
> 1M files a day
3
www.egi.euEGI-InSPIRE RI-261323
Tue25 - High load on PanDA Servers
• Average time for DQ2+LFC registration increased dramatically causing high load on PanDA Servers
• Some LFC timings in the logs indicated that the registration slowness was in DQ2
CC writer 1
CC writer 2Number of sessions open on ADCR3 instance. Mostly by ATLAS_LFC_W user
4
www.egi.euEGI-InSPIRE RI-261323
Tue25 - High load on PanDA Servers
• Other observations that came up during the investigation• Some improvements on the LFC client are going to be
discussed during “DB technical meeting on the LFC” on Wednesday 3rd Oct
• PanDA server LFC registration should be activated for all sites in order to avoid individual registrations by the pilot
• aCT registers in bursts without bulk methods: In the LFC logs we saw 4k accesses over 1 hour and only 7 access over another hour
• There were 2 SS machines serving the DE cloud (i.e. the same sites twice) with similar configuration
5
www.egi.euEGI-InSPIRE RI-261323
Thu27- SS callbacks to dashboard piling up
SS-FR
• Initially we thought it was exclusively due to the CERN network intervention• After checking the logs we have seen slow callbacks before the
intervention on different SS machines• D. Tuckett is checking the situation
6
www.egi.euEGI-InSPIRE RI-261323
Other incidents and downtimes
• Monday• New PanDA proxy had not been updated on PanDA Monitor machines (
Savannah: 97737)• INFN-T1 scheduled downtime for ~1 hour
• Tuesday• RAL 6h upgrade to CASTOR 2.1.12-10. Alastair set UK cloud brokeroff
on previous evening
• Thursday• CERN network intervention to replace some switches. Services under
risk were CASTOR, EOS, elog and dashboard. Smooth intervention - NTR.
• Friday• BNL to ASGC transfer errors. Being investigated by both sides during the
weekend. ASGC FTS is blocked to access BNL SRM and routing path is changed. (GGUS:86537)
7
www.egi.euEGI-InSPIRE RI-261323
Other incidents and downtimes (2)
• Sunday • PVSS DCS replication with large delays due to high insertion rate.
DCS expert had to be called on Sunday• RAL had failing jobs due to put errors and transfer errors –
including T0 export. Caused by problem with Stager databases and resolved during Sunday late evening(GGUS:86552)
• Saturday• SS-SARA had CRITICAL errors.
MySQL DB corruption? Problem to be understood by DDM experts.
8
www.egi.euEGI-InSPIRE RI-261323
Acknowledgements
• Except for occasional highlights it has been a very quiet week
• Thanks a lot to • ADCoS expert&shifters, and to the
Comp@P1 shifter for the good work• experts of the different components and
sites for the quick reaction• Alessandro, Ueda for their support
9
www.egi.euEGI-InSPIRE RI-261323
EGI-InSPIRE
www.egi.euEGI-InSPIRE RI-261323
Backup slides
10
www.egi.euEGI-InSPIRE RI-261323
NL transfer errors
11