SRM CCRC-08 and Beyond

Preview:

DESCRIPTION

SRM CCRC-08 and Beyond. Shaun de Witt CASTOR Face-to-Face. Introduction. Problems in 1.3-X And what we are doing about them Positives Setups Recommendations Future Developments Release Procedures. Problems - Database. Deadlocks Observed at CERN and ASGC (CNAF too?) - PowerPoint PPT Presentation

Citation preview

SRM CCRC-08 and Beyond

Shaun de WittCASTOR Face-to-Face

Introduction

Problems in 1.3-X And what we are doing about them

Positives Setups

Recommendations Future Developments Release Procedures

Problems - Database Deadlocks

Observed at CERN and ASGC (CNAF too?) Not at RAL – not sure why??? Two types (loosely)

Daemon/daemon deadlocks Server/Daemon deadlocks

Startup problems Too many connections ORA-0600 errors

Daemon/Daemon deadlocks

Found ‘accidentally’ at CERN Caused by multiple back-ends

running talking to the same database Leads to database deadlocks in GC

In 2.7 GC has moved into database as a procedure. Could be ported to 1.3, but not

planned

Server/Daemon deadlocks

Caused by using CASTOR fillObj() API When filling subrequests, multiple calls can

lead to two threads blocking one another. Daemon and Server both need to check

status and possibly modify subrequest info. Solution proposed is to take lock on the

request This would stop deadlocks But could lead to lengthy locks

Problems - Database Start up problems

Seen often at CNAF, infrequently at RAL.

TNS – ‘no listener’ error. Need to check logs at startup.

No solution at the moment Restarting cures problem. Could add monitoring to watch for

this error.

Problems - Database Too many connections

Seen at CERN Partly down to configuration

Many SRMs talking to the same database instance. Two solutions

More database hardware Fewer SRMs on same instance But expensive

Reduce Threads on server and daemon May cause TCP timeout errors under load (server) or

cause put/get requests to be processed too slowly (daemon)

More on configuration later

Problems - Database

ORA-0600 (Internal Error) problems Seen at RAL and CERN Oracle internal error Will render SRM useless

Fix available from ORACLE RAL has not seen it since applying fix Gordon Brown at RAL can provide

details

Problems - Network

Intermittent CGSI errors Terminal CGSI errors SRM ‘lock-ups’

Problems - Network Intermittent CGSI-gSOAP errors

cgsi-gSOAP errors reported in logs and to client

Seen 2-10 times per hour (at RAL) Correlation in time between front-ends

Both will get an error at about the same time Cause is unclear

No solution at the moment Seems to happen < 0.1% of requests at RAL

Problems - Network Terminal CGSI-gSOAP errors

All threads end up returning CGSI-gSOAP errors Can affect only 1 of front ends Cause unknown Does not seem correlated with load or request type

No solution at moment ASGC site report indicated may be correlated with

database deadlocks(?) Need monitoring to detect this in the log file Restart of effected front end normally clears problem. New version of CGSI plug-in available, but not yet

tested

Problems - Network SRM becomes unresponsive

Debugging indicates all threads stuck in recv() Cause unknown

May have been cause of ATLAS ‘blackouts’ during first CCRC

New releases include recv() and send() timeouts Should stop this Two new configurable parameters in srm2.conf

Problems - Other

Interactions with CASTOR Behaviour when CASTOR is slow

Needless RFIO calls loading job slots

Bulk Removal requests Use of MSG field in DLF

Problems - Other Behaviour when CASTOR becomes slow

See error “Too many threads busy with CASTOR” Can block new requests coming in

But useful diagnostic of CASTOR problems Solution is to decrease STAGERTIMEOUT in

srm2.conf Default 900 secs too long

Most clients give up after 180 secs No ‘hard and fast’ rule about what it should be

Somewhere between 60 and 180 is best guess. Pin time

Implementation ‘miscommunication’ – top heavy a weight applied

Fixed in 1.3-27 Also reduce Pin Lifetime in srm2.conf

Problems - other Needless RFIO calls

Identified by CERN Takes up jobs slots on CASTOR

Timeout after 60 seconds On all GETS without a space token Introduced when support for multiple

default spaces was introduced Fix already in CVS

For release 2.7 Duplicates code when space token provided

Could be backported to 1.3

Problems - other Bulk removal requests

Sometime produce CGSI-gSOAP errors for large numbers of files (>50)

But deletion does work – problem on send()? May be load related

On one day 4/6 tests with 100 files produced this error The next day 0/6 tests with 1000 files produced this

error Some discussion about removing stager_rm

and just do nsrm May help speed up processing But would leave more work for CASTOR cleaning

daemon

Problems - Other

Lots of MSG fields left blank Problem for monitoring

Addressed in 2.7 Will not be back ported.

Occasional crashes Traced to use of strtok (not _r) Fixed in 1.3-27

Positives Request rate

At RAL on 1 cms front end with 50 threads: 21K requests/hr

Distribution of type of request not known.

Processing speed Again using CMS at RAL Daemon running 10/5 threads Put requests in 1-5 seconds

Same for GET requests w/o tape recall

Positives

Front end quite stable At RAL few interventions required

SETUPS

Different sites have different hardware set ups Hope you can fill the gaps…!

RAL Setup

3 Node RAC

SRM-ATLAS SRM-CMS SRM-LHCb SRM-ALICE

CERN Setup

shared-db

srm-cms

atlas-db lhcb-db

srm-alice srm-dteam srm-ops

srm-atlas srm-lhcb

Single Machine

CNAF Setup

srm-cms srm-shared

cms-db shared-db

Single Machine

ASGC Setup

srm

srm-db castor-db dlf-db

3 node RAC

Useful Configuration Parameters Based on you setup, you will need to tune some or all of the following

parameters: SERVERTHREADS CASTORTHREADS REQTHREADS POLLTHREADS COPYTHREADS

The more instances on a single database instance, the fewer threads should be assigned to the SRM

Need to balance request and processing rates on daemon and server SOAPBACKLOG SOAPRECVTIMEOUT SOAPSENDTIMEOUT

Number of SOAP requests, and timeouts related to recv() and send() Best ‘guesstimate’ for these are 100, 60, 60

TIMEOUT Stager timeout in castor.conf Best ‘guesstimate’ 60-180 seconds

PINTIME Keep low

Future Developments

Move to SL4 Move to castor clients 2.1.7 New MoU

Move to SLC4

URGENT No support for SLC3 Support effort for SL3 dwindling

Have built and tested one version In 1.3 series

All new developments (2.7-X) on SL4 No new development in 1.3 series

Move to 2.1.7 clients URGENT

Addresses security vulnerability with regards to proxy certificates

Much better error messaging Fewer ‘unknown error’ messages

2.1.3 clients no longer supported or developed

Since this requires a schema change, releases in this series will be 2.7-X

New MoU

Major new features: srmPurgeFromSpace

Used to remove disk copies from a space Initial implementation will only remove

files currently also on tape VOMS based security

This will be implemented in CASTOR but may need changes to SRM/CASTOR interface.

Future Development Summary New features will be put into 2.7-X or

later releases. 2.7-X releases only on SLC4

Is port of 1.3-X to SLC4 required? Esp. given security hole in 1.3

Will require 2.1.7 clients installed on SRM nodes

Timescale? End June. Tall order!

Release Procedures Following problems just after CCRC

Srm seemed to pass all tests But daemon failed immediately in production

(CERN and RAL) Brought about by a ‘simple’ change

which only affected recalls when no space token was passed. Clear need for additional tests before release Public s2 not enough

Pre-Release Procedures (Re) Developing shell test tool which will

be delivered with the SRM. To include basic tests of all SRM functions Will include testing of tape recalls if possible

(i.e. not if only using a Disk1Tape0 system) New tests added when we find missing cases. Will require tester to have certificate (i.e. can

not be run as root) Looking at running FULL s2 test suite

This includes tests of a number of invalid requests

Not normally run since VERY time consuming

Pre-Release Procedures

As now, s2 tests will be run over 1 week to try and ensure stability

Problem still is stress testing No dedicated stress tests exist

But this is most likely to catch database problems.

Could develop simple ones But would they be realistic enough?