View
216
Download
0
Category
Tags:
Preview:
Citation preview
Distributed Analysis
K. HarrisonLHCb Collaboration Week, CERN, 1 June
2006
1 June 2006 2/20
Aims of distributed analysisPhysicist defines job
to analyse (large) dataset(s)Use distributed
resources (computing Grid)
Subjob 1Subjob 2Subjob 3 Subjob n
Job
Distribute workload
LHCb distributed-analysis system based on LCG (Grid infrastructure), DIRAC (workload management) and Ganga (user interface)
Single job
submitted
Combined output
returned
1 June 2006 3/20
Tier-1 centres
Tier-2 centres
LHCb computing model
Baseline solution: analysis at Tier-1 centres
Analysis at Tier-2 centres not in baseline solution, but not ruled out
1 June 2006 4/20
DIRAC submission to LCG : Pilot Agents
JobReceiver
LFC
MatcherData
Optimiser
JobDB
TaskQueue
AgentDirector
Pilot
Agent
LCG
WMS
Computing
Resource
Pilot
Agent
AgentMonitor
DIRAC
Data Optimiser queries Logical File Catalogue to identify sites for job execution
Agent Director submits Pilot Agents for jobs
in waiting state
Agent Monitor tracks Agent status,
and triggers further submission as needed
1 June 2006 5/20
DIRAC submission to LCG : Bond Analogy
JobReceiver
LFC
Matcher
JobDB
TaskQueue
AgentDirector
Pilot
Agent
LCG
WMS
Computing
Resource
AgentMonitor
Data Optimiser queries Logical File Catalogue to identify sites for job execution DIRAC
Agent Monitor tracks Agent status,
and triggers further submission as needed
Agent Director submits Pilot Agents for jobs
in waiting state
1 June 2006 6/20
Ganga job abstraction• A job in Ganga is constructed from a set of
building blocks, not all required for every job
Merger
Application
Backend
Input Dataset
Output Dataset
Splitter
Data read by application
Data written by application
Rule for dividing into subjobs
Rule for combining outputs
Where to run
What to run
Job
1 June 2006 7/20
Framework for plugin handling
• Ganga provides a framework for handling different types of Application, Backend, Dataset, Splitter and Merger, implemented as plugin classes
• Each plugin class has its own schema
DaVinci
GangaObject
IApplication IBackendIDatasetISplitter IMerger
Dirac-version-cmt_user_path-masterpackage-optsfile-extraopts
User
System
Plugin
Interfaces
Example plugins
and schemas
-CPUTime
-destination-id-status
1 June 2006 8/20
Ganga Command-Line Interface in Python (CLIP)
• CLIP provides interactive job definition and submission from an enhanced Python shell (IPython)– Especially good for trying things out, and understanding how the system works
# List the available application plug-ins list_plugins( “application” ) # Create a job for submitting DaVinci to DIRAC j = Job( application = “DaVinci”, backend = “Dirac” # Set the job-options file j.application.optsfile = “myOpts.txt” # Submit the job j.submit() # Search for string in job’s standard output !grep “Selected events” $j.outputdir/stdout
1 June 2006 9/20
Ganga scripting
• From the command line, a script myScript.py can be executed in the Ganga environment using: ganga myScript.py – Allows automation of repetitive tasks
• Scripts for basic tasks included in distribution # Create a job for submitting Gauss to DIRAC ganga make_job Gauss DIRAC test.py # Edit test.py to set Gauss properties, then submit job ganga submit test.py # Query status, triggering output retrieval if job is completed ganga query
Approach similar to the one typically used when submitting to a local batch system
1 June 2006 10/20
Ganga Graphical User Interface (GUI)
• GUI consists of central monitoring panel and dockable windows
• Job definition based on mouse selections and field completion
• Highly configurable: choose what to display and howJob
details
Logical
Folders
Scriptor
Job Monitoring
Log window
Job builder
1 June 2006 11/20
Shocking News!
• LHCb Distributed Analysis system is working well• DIRAC and Ganga providing complementary
functionality• People with little or no knowledge of Grid
technicalities are using the system for physics analysis
• More than 75 million events processed in past three months
• Fraction of jobs completing successfully averaging about 92%
• Extended periods with success rate >95%
How can this be happenin
g?
Did he say 75 millio
n?
Who’s doing this?
1 June 2006 12/20
Beginnings of a success story
• 2nd LHCb-UK Software Course held at Cambridge, 10th-12th January 2006
• Half day dedicated to Distributed Computing: presentations and 2 hours of practical sessions– U.Egede: Distributed Computing & Ganga– R.Nandakumar: UK Tier-1 Centre– S.Paterson: DIRAC– K.Harrison: Grid submission made simple
• Made clear to participants a number of things– Tier 1 centres have a lot of resources– Easy to submit jobs to Grid using Ganga– DIRAC ensures high success rate Distributed analysis not just possible in theory but possible in practice
Photographs by P.Koppenburg
1 June 2006 13/20
Cambridge pioneers of distributed analysis
• C.Lazzeroni: B+ D0(KS0+-)K+
• J.Storey: Flavour tagging with protons• Project students:
– M.Dobrowolski: B+ D0(KS0K+K-)K+
– S.Kelly: B0 D+D- and BS0 DS
+DS-
– B.Lum: B0 D0(KS0+-)K*0
– R.Dixon del Tufo: BS0
– A.Willans: B0 K*0+-
• R.Dixon del Tufo had previous experience of Grid, Ganga and HEP software
• Others encountered these for first time at LHCb-UK software course
Cristina decided
she preferred Cobra to Python
Photograph by A.Buckley
CHEP06, Mumbai
1 June 2006 14/20
Work model (1)• Usual strategy has been to develop/test/tune algorithms
using signal samples and small background samples on local disks, then process (many times) larger samples (>700k events) on Grid
• Used pre-GUI version of Ganga, with job submission performed using Ganga scripting interface– Users need only look at the few lines for specifying DaVinci version, master package, job options and splitting requirements
– Splitting parameters are files per job and maximum total number of files (very useful for testing on a few files)
– Script-based approach popular with both new users (very little to remember) and experienced users (similar to what they usually do to submit to a batch system)
– Jobs submitted to both DIRAC and local batch system (Condor)
1 June 2006 15/20
Work model (2)• Interactive Ganga session started to have status updates
and output retrieval• DIRAC monitoring page also used for checking job progress• Jobs usually split so that output files were small enough
to be returned in sandbox (i.e. retrieved automatically by Ganga)
• Large outputs placed on CERN storage element (CASTOR) by DIRAC– Outputs retrieved manually using LCG transfer command (lcg-cp) and logical-file name given by DIRAC
• Hbook files merged in Ganga framework using GPI script:– ganga merge 16,27,32-101 myAnalysis.hbook
• ROOT files merged using standalone ROOT script (from C.Jones)
• Excellent support from S.Patterson and A.Tsaregorodtsev for DIRAC problems/queries, and from M.Bargiotti for LCG catalogue problems
1 June 2006 16/20
Example plots from jobs run on distributed-analysis system
J.Storey:Flavour tagging with protons
Analysis run on 100k Bs J/ tagHLT
events
C.Lazzeroni:Evaluation of background forB+ D0(K0+-)K+
Analysis run on 400k B+ D0(K0+-)K*0
Results presented at CP Measurements WG meeting, 16 March 2006
1 June 2006 17/20
Project reportsR.Dixon del Tufo
BS0
M.DobrowolskiB+ D0(KS
0K+K-)K+ B.Lum
B0 D0(KS0+-)K*0
A.WillansB0 K*0+-
S.KellyB0 D+D- and BS
0 DS+DS
-
• Reports make extensive use of results obtained usingdistributed-analysis system, especially for background estimates
• Aim to have all reports turned into LHCb notes
1 June 2006 18/20
Job statistics (1)
DIRAC job
state
outputready
stalled failed other all
Number of jobs
5036 127 257 68 5488
• Statistics taken from DIRAC monitoring page for analysis jobs submitted from Cambridge (user ids: cristina, deltufo, kelly, lum, martad, storey, willans) between 20 February 2006 (week after CHEP06) and 15 May 2006
• Estimated success rate: outputready/all = 5036/5488 = 92%• Individual job typically processes 20 to 40 files of 500-1000
events each– Estimated number of events successfully processed:30 500 5036 = 7.55 107
1 June 2006 19/20
Job statistics (2)• Stalled jobs: 127/5488 = 2.3%
– Proxy expires before job completes• Problem essentially eliminated by having Ganga create proxy with long lifetime
– Problems accessing data?• Failed jobs: 257/5488 = 4.7%
– 73 failures where input data listed in bookkeeping database (and physically at CERN), but not in LCG file catalogue• Files registered by M.Bargiotti, then jobs ran successfully
– 115 failures 7-20 April because of transient problem with DIRAC installation of software (associated with upgrade to v2r10)
Excluding above failures, job success rate is: 5036/5300 = 95%
1 June 2006 20/20
Conclusions
• LHCb distributed-analysis system is being successfully used for physics studies
• Ganga makes the system easy to use• DIRAC ensures system has high efficiency• Extended periods with job success rate >95%• More than 75 million events processed in past three
months• Working on improvements, but this is already a
useful tool• To get started using the system, see user
documentation on Ganga web site: http://cern.ch/ganga
He did say 75 millio
n!
Recommended