Upload
brandi
View
30
Download
1
Embed Size (px)
DESCRIPTION
Analysis Operations Experience. 2010 highlights 2011 work list. Analysis Metrics. Overall status tracked weekly with summary mails sent to AnalysisOperations forum and with series of graphs https://hypernews.cern.ch/HyperNews/CMS/get/analysisoperations/189.html - PowerPoint PPT Presentation
Citation preview
Stefano Belforte Analysis Operation report at CMS week
Dec 7, 2010 1
Analysis Operations Experience
2010 highlights2011 work list
Dec 7, 2010Analysis Operation report at CMS week
2Stefano Belforte
Analysis Metrics
Overall status tracked weekly with summary mails sent to AnalysisOperations forum and with series of graphs https://hypernews.cern.ch/HyperNews/CMS/get/analysisoperations/189.html https://hypernews.cern.ch/HyperNews/CMS/get/analysisoperations/192.html https://twiki.cern.ch/twiki/bin/view/CMS/AnalysisOpsMetricsPlots
Allows us to see the forest while watching daily the trees
Dec 7, 2010Analysis Operation report at CMS week
3Stefano Belforte
Lot’s of analsys jobs
More analysis then production jobs, but shorter (not much) Load on infrastructure scales with #jobs Until Crab3 (at least) could only improve job duration via user education
Education campaigns need tools and effort
Not much room left for more jobs
#jobs running at same time (i.e. used slots) is showing saturation
Job wrapper time (avg.last week)
Analysis: 2h
Production: 5h
JobRobot: 20min
Dec 7, 2010Analysis Operation report at CMS week
4Stefano Belforte
Too many jobs ?
It is a success of CMS that so many users manage to run so much data processing
But expect more frustration as waiting time grows May also expose to new failure modes (like any new regime, so
far)
Dec 7, 2010Analysis Operation report at CMS week
5Stefano Belforte
Too much data ?
Space centrally managed by Analysis Operations (2.7 PB) is ~full
Dec. 5th, 2010, 19:45 o‘clockDuring Deletion Campaign
AnOps just cleaned up in preparation for 39 based reprocessing AnOps can no longer host all commonly used versions of MC
and data MC for two energies (7TeV and 8TeV(?)), with and w/o pile-
up, two releases data for two releases
Unlikely that all of this will fit into central space unless we switch to AOD distribution only
Expect physics group to use more of their space in 2011
2029 TB usedby AnOps
1.5 of 3.75 PB free at official group sites
Dec 7, 2010Analysis Operation report at CMS week
6Stefano Belforte
Caring for user’s data: placement
in 2010 17 PB transferred to
T2s 4.5 PB transferred
from T2s
Central space completely refreshed about every3 months
Data distribution works quite well
To T2s
From T2s
Dec 7, 2010Analysis Operation report at CMS week
7Stefano Belforte
Elevates datasets from local to global DBS and PhEDEx Two instances deployed: RWTH Aachen, Imperial College
London Three part time operators, Th. Kress, M. Giffels and M. Cutajar StoreResults monitoring developed in 2010
Access restricted to group representatives for performance reasons Allows physics groups to monitor their requests in real time Very useful for the operators, too. For example StdOut and StdErr of
retrieved jobs are directly available
Caring for user’s data: StoreResults Service
average 1-2 requests per day usually in bunches of 2-10
340 requests in 2010, ~98% elevated successfully
average time for the elevation 46 hours, however 3000 job requests from SUSY take its time
Dec 7, 2010Analysis Operation report at CMS week
8Stefano Belforte
StoreResults Future Development
re-implement StoreResults in WMAgent support growing datasets natively
RequestManager will be used for data processing requests Combines the possibility of making request, approvals by
authorized people as well as tracking of the request in a single place, replace current Savannah interface
Development schedule and support will be discussed with Computing/Offline Management this week
Dec 7, 2010Analysis Operation report at CMS week
9Stefano Belforte
Caring for user’s data: transfer
Users need to move data from one T2 to another T2 To offer more routes/bandwidth to data replication To replicate data elevated via StoreResult To collect CRAB outputs staged out at job execution
Requires Staging area /store/temp/user at each site Transfer link amon all T2’s (full mesh)
Consistent deployment on large distributed system Simple, well defined goal Persistent, patient month-long effort
Dec 7, 2010Analysis Operation report at CMS week
10Stefano Belforte
Commissioning data transfer infrastructure
Local CRAB stage out validated at 41/48 sites
3 of the missing 7 fails due to known Crab shortcoming
Dec 7, 2010Analysis Operation report at CMS week
11Stefano Belforte
Caring for user’s jobs: crab feedback
Support more then 400 different users every week Cater from data lookup to grid and site problems
Draw a line at “cmsRun” (could be easier with better reporting, often users do not know how to read cmsRun exit codes)
Not possible to help w/o extensive experience and knowledge of the crab, grid, facilities internals, several attempts at bringin in new personnel,failed with “I have no idea how to help here”
Mail volume (#messages) handled by Analysis Operations on the CrabFeedback forum in 2010
Dec 7, 2010Analysis Operation report at CMS week
12Stefano Belforte
Caring for user’s jobs: Crab Server
Running 6 now: 2 at CERN, 2 at UCSD, 1 each at Bari and DESY By operational choice one in drain at any time to allow DB reset Other issues continously pop-up (from hw to sw) and usually
have 4 or less Good side
Installed, operated and by large debugged by non-developers Much improved after last summer developers effort More then 50% of the analysis jobs run through Crab Servers Most of the time: simply work
Bad side Operationally heavy jobs/task tracking can fail and users need to resubmit Resubmission ackward at times Status reporting obscure at times, leads to unnecessary help
requests
Dec 7, 2010Analysis Operation report at CMS week
13Stefano Belforte
Give us a hand ! How to be a good user
Read and use documentation
Give all information in the first message, before we need to ask crab.cfg, crab.log, relevant stdout/err, dashboard URL ..
Do not use blindly config. files you do not understand
Make some effort to figure out the problem yourself (in the end it will save you time)
Do not expect solution in “minutes”, be prepared for “tomorrow”
Get trained before you need to do something in a rush, ramp up in steps: 1 job, 10 jobs, 100 jobs, …
If you see time-critical CRAB work ahead : Get hands on experience now
Dec 7, 2010Analysis Operation report at CMS week
14Stefano Belforte
The good (2010 highlights)
Users do a lot of analysis work
We believe we know what’s going on
Grid works At least as well as our tools We never have something as easy and simple as “service X
at site Y is broken since a day and nobody noticed”
We have replaced developers in crab server daily ops and crab feedback daily support No service deterioration for user’s community
We care successfully for large amount of data
Dec 7, 2010Analysis Operation report at CMS week
15Stefano Belforte
The bad (current concerns)
We are reaching the point of being resource constrained Many decisions were made easy by abundance
do no debug: resubmit do not fix site: replicate data
Doing better requires thinking, planning, following up daily and better tools i.e. human time
Crab2 is in maintenance mode, while still operationally heavy Working in firefighting / complain-driven mode much more
then desired (and expected) As feared transient random problems in the infrastructure
are main effort drain
More and more difficult to find manpower willing to work here Maybe collaboration was expecting effort here to be
ramping down by now ?
Dec 7, 2010Analysis Operation report at CMS week
16Stefano Belforte
The ugly (more work in 2011)
Major technology transistion looming ahead (Crab3) Can we spare operation resources for the transition ?
More new things coming down the pipeline DBS3 Something new in monitoring Multicore processing New data access methods/patterns
Distributed system stays distributed but dependencies among sites grow as WAN boundary gets fainter Very fast and very performat connection does not take
away having to deal with two different systems Nothing as easy and basic as copying one file from here to
there, yet figuring out why data does not move in PhEDEx is still something only a handful of people can do
Dec 7, 2010Analysis Operation report at CMS week
17Stefano Belforte
The year that comes
Things we need to do in 2011 (and following) for a brighter future Besides all we already do and need to keep doing
Work on job monitoring and error reporting Not good enough yet to tell where exactly resources are
wasted and how to fix Work on crab servers monitoring, current one not good enough
to avoid complain-driven mode spot systemic problems in server or middleware
Work on user problem reporting Reduce number of mail iterations
Learn how to use sites of very different size and reliability Since we may need all of them
Work on better alignement of FacOps site testing with AnOps reality (continous effort)
Dec 7, 2010Analysis Operation report at CMS week
18Stefano Belforte
Summary
Analysis metrics High level overview: under control Error/efficiency reporting: work to do
Data placement and transfer for physics users Centrally managed storage area: OK but filling up T2-T2 transfer links: success story Group data distribution: OK but need to transition to new tool
CRAB operations Crab feedback: unsubstainable since 1 year, but coping Crab server: works, but too much operationally heavy Local stageout: storage commissioned, need to validate tool
Concerns End of abundant resources era: more work needed New tools coming: more work needed
Things to work on in 2011: pretty long list given we should be in steady operations