Upload
lynne-thomas
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
CMS Issues
Background – RAL Infrastructure
TMNsdXrd-mgr
TM RhdstagerdTGW
CupvVmgrVdqmnsd
CupvVmgrnsd
Common Layer
Instance Headnodes
Diskservers (x20)
CASTOR 2.1.14-15XROOT 3.3.3-1
Background – xroot infrastructure
Diskservers (x20)
Xroot manager(3.3.3-1)
Xroot redirector (4.X)
European redirector1European redirector1 European redirector1
Global redirectors Global redirectorsGlobal redirectors
Local WNs
The Grid
The Problem…s• Pileup workflow
– Local jobs had 95% failure rate– Jobs that managed to run had only 30%
efficiency• AAA failure
– Despite being the second site to integrate into AAA
– 100% failure for periods of 30 minutes to several days
Tackling the Problems
Pileup Broken Down
• Data accessed through xroot
• >95% of data at RAL• Two problems in one
– Slow opening times (15->600 secs)
– Slow transfers rates– 100% CPU WIO
Slow Opening Times
• No obvious place– Delays at all phases– Almost all DB time spent in
SubRequestToDo
Solution 1(aka Go Faster Stripes Solution
Database Surgery
• DBMS_ALERT suspect to add to delays under load– Modified DB code to sleep for 50 ms (limiting
rate to 20ms for subreqtodo)• Tested on preprod (functionally)
– Improved open time from 3-15 secs to 0-5 secs• Deployed on all instances• Made NO difference for CMS problem
Solution 2(aka The Heart Bypass Solution)
Bypassing Scheduler
• Modified xroot to disable scheduling• RISK
– nothing restricting access to disk server– ONLY applied to CMS
• RESULT– Open times reduced to 1-30 seconds– WIO still flatlining at 100%
• ‘SUCCESS’
Improving IO
• Difficult to test– Could not generate artificially– Needed pileup workflow to be executing
• Testing on production ;)
• Did ‘the usual’– Reducing allowed connections– Throttling batch jobs
Solution 3(aka The Don’t Do This Solution)
• Change UNIX scheduler– Now easy and can be done in-situ
• Four schedulers (plus options)– Cfq (default), anticipatory, deadline, noop– Plus associated config
• Switched to noop– WIO dropped to 60%– Network rate increased 4x
XROOT Problems
Observations
• Random Failures (or more correctly random successes)
• Local access was OK (if slow – see previous)
• Lack of visibility up the hierarchy didn’t help – REALLY difficult to debug
Investigating the Problem
• Set up parallel infrastructure– Replicate manager, RAL redirector and
European redirector• Immediately saw the same issue…
Causes of Failure…
• Caching!– Cmsd and xrootd timed out at different
times– Xroot can return ENOENT, but later cmsd
gets response, and subseq access work– If cmsd doesn’t get a response, all future
requests get ENOENT• But why the slow response…?
Log Mining…• Each log looked like
performance was good• Part of problem
– Time resoln in xroot 3.3.X– And logging generally
• Finally found delays in ‘local’ nsd– Processing time was good– But delays in servicing
requests
Solution – RAL Infrastructure
TMNsdXrd-mgr
TM RhdstagerdTGW
EU Redirectors
The Grid
RAL
Diskservers (x20)
CASTOR 2.1.14-15XROOT 3.3.6-1
Global Redirectors
NsdXrd-mgr
Xroot redirector (4.X)Local WNs
RemoteWNs