CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv

CMS Issues

Background – RAL Infrastructure

TMNsdXrd-mgr

TM RhdstagerdTGW

CupvVmgrVdqmnsd

CupvVmgrnsd

Common Layer

Instance Headnodes

Diskservers (x20)

CASTOR 2.1.14-15XROOT 3.3.3-1

Background – xroot infrastructure

Diskservers (x20)

Xroot manager(3.3.3-1)

Xroot redirector (4.X)

European redirector1European redirector1 European redirector1

Global redirectors Global redirectorsGlobal redirectors

Local WNs

The Grid

The Problem…s• Pileup workflow

– Local jobs had 95% failure rate– Jobs that managed to run had only 30%

efficiency• AAA failure

– Despite being the second site to integrate into AAA

– 100% failure for periods of 30 minutes to several days

Tackling the Problems

Pileup Broken Down

• Data accessed through xroot

• >95% of data at RAL• Two problems in one

– Slow opening times (15->600 secs)

– Slow transfers rates– 100% CPU WIO

Slow Opening Times

• No obvious place– Delays at all phases– Almost all DB time spent in

SubRequestToDo

Solution 1(aka Go Faster Stripes Solution

Database Surgery

• DBMS_ALERT suspect to add to delays under load– Modified DB code to sleep for 50 ms (limiting

rate to 20ms for subreqtodo)• Tested on preprod (functionally)

– Improved open time from 3-15 secs to 0-5 secs• Deployed on all instances• Made NO difference for CMS problem

Solution 2(aka The Heart Bypass Solution)

Bypassing Scheduler

• Modified xroot to disable scheduling• RISK

– nothing restricting access to disk server– ONLY applied to CMS

• RESULT– Open times reduced to 1-30 seconds– WIO still flatlining at 100%

• ‘SUCCESS’

Improving IO

• Difficult to test– Could not generate artificially– Needed pileup workflow to be executing

• Testing on production ;)

• Did ‘the usual’– Reducing allowed connections– Throttling batch jobs

Solution 3(aka The Don’t Do This Solution)

• Change UNIX scheduler– Now easy and can be done in-situ

• Four schedulers (plus options)– Cfq (default), anticipatory, deadline, noop– Plus associated config

• Switched to noop– WIO dropped to 60%– Network rate increased 4x

XROOT Problems

Observations

• Random Failures (or more correctly random successes)

• Local access was OK (if slow – see previous)

• Lack of visibility up the hierarchy didn’t help – REALLY difficult to debug

Investigating the Problem

• Set up parallel infrastructure– Replicate manager, RAL redirector and

European redirector• Immediately saw the same issue…

Causes of Failure…

• Caching!– Cmsd and xrootd timed out at different

times– Xroot can return ENOENT, but later cmsd

gets response, and subseq access work– If cmsd doesn’t get a response, all future

requests get ENOENT• But why the slow response…?

Log Mining…• Each log looked like

performance was good• Part of problem

– Time resoln in xroot 3.3.X– And logging generally

• Finally found delays in ‘local’ nsd– Processing time was good– But delays in servicing

requests

Solution – RAL Infrastructure

TMNsdXrd-mgr

TM RhdstagerdTGW

EU Redirectors

The Grid

RAL

Diskservers (x20)

CASTOR 2.1.14-15XROOT 3.3.6-1

Global Redirectors

NsdXrd-mgr

Xroot redirector (4.X)Local WNs

RemoteWNs

Documents

CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv