LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb

Embed Size (px)

DESCRIPTION

Resource usage LHCb to LHCC and C-RSG review, PhC3

Citation preview

LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb LHCb to LHCC and C-RSG review, PhC2 Activities in 2009-Q3/Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications o Stable as of September for real data o Fast minor releases to cope with reality of life m Monte-Carlo o Intensive MC09 simulation 5TeV) P Minimum bias P b- and c- inclusive P b signal channels o Few events in foreseen 2009 configuration (450 GeV) o MC09 stripping (2 passes) P Trigger stripping P Physics stripping m Real data reconstruction and stripping o As of November 20 th Resource usage LHCb to LHCC and C-RSG review, PhC3 139 sites hit, 4.2 million jobs m Start in June: start of MC09 LHCb to LHCC and C-RSG review, PhC4 Job failure: 15% (17% at Tier1s) LHCb to LHCC and C-RSG review, PhC5 Failure breakdown LHCb to LHCC and C-RSG review, PhC6 Production and user jobs LHCb to LHCC and C-RSG review, PhC7 Jobs at Tier1s LHCb to LHCC and C-RSG review, PhC8 Job types at Tier1s LHCb to LHCC and C-RSG review, PhC9 CPU used (not normalised) LHCb to LHCC and C-RSG review, PhC10 m Average job duration o 5.6 hours for all jobs o 20 mn for user jobs (20%) o 6.6 hours for production jobs m Average job duration o 5.6 hours for all jobs o 20 mn for user jobs o 6.6 hours for production jobs LHCb to LHCC and C-RSG review, PhC11 CPU usage (not normalised) LHCb to LHCC and C-RSG review, PhC12 WLCG vs LHCb accounting (unnormalised) m 13% more in WLCG than in DIRAC (unnormalised) o 1.26 Mdays vs 1.1 Mdays o Overhead of non reporting jobs + pilot/LCG/batch frameworks m Average CPU power: 1.5 kSI2k (from WLCG accounting) LHCb to LHCC and C-RSG review, PhC13 Normalised CPU usage in 2009 m Ramping up of pilot role in summer m Resource usage decreased since LHC restarted o Concentrate on (few) real data o Wait for data analysis for continuing MC simulation LHCb to LHCC and C-RSG review, PhC14 m Group 1: production m Group 2: pilot m Group 3 & 4: user m Group 5: lcgadmin Resource usage LHCb to LHCC and C-RSG review, PhC15 m Note: CERN above does not include non-Grid usage o From WLCG accounting: 32% is non-Grid at CERN o CERN number should then read: 2.18 kHS06.years m CPU usage within 10% of requests m Distribution not exactly like expected o More non-Tier1 resources available P Less MC ran at CERN + Tier1s o Almost no real data: less resources used at CERN P CAF not used as much as expected SiteUsed (kHS06.years)Requested (kHS06.years) CERN Tier1s Tier2s Total Storage usage LHCb to LHCC and C-RSG review, PhC16 m *) From Castor queries today m **) From WLCG accounting end December m ***) Including 420 TB for T1D0 cache m Sites provided slightly more than the pledges o Thanks! o At CERN, some disk pools (default, T1D0) were not included in the requests but are in the accounting SiteRequestedAllocatedUsed CERN *) TxD CERN *) T1D irrelevant CERN **) Tier1s **) 1740 ***) Experience with real data LHCb to LHCC and C-RSG review, PhC17 First experience with real data m Very low crossing rate o Maximum 8 bunches colliding (88 kHz crossing) o Very low luminosity o Minimum bias trigger rate: from 0.1 to 10 Hz o Data taken with single beam and with collisions LHCb to LHCC and C-RSG review, PhC18 No zero-suppression in VELO Otherwise ~25 GB only! Real data processing m Iterative process o Small changes in reconstruction application o Improved alignment o In total 7 sets of processing conditions P Only last files were all processed 4 times now (twice in 2010) m Processing submission o Automatic job creation and submission after: P File is successfully migrated in Castor P File is successfully replicated at Tier1 o If job fails for a reason other than application crash P The file is reset as to be processed P New job is created / submitted (automatic) o Processing more efficient at CERN (see later) P Eventually after few trials at Tier1, the file is processed at CERN o No stripping ;-) P DST files distributed to all Tier1s for analysis LHCb to LHCC and C-RSG review, PhC19 Reconstruction jobs LHCb to LHCC and C-RSG review, PhC20 Issues with real data m Castor migration o Very low rate: had to change the migration algorithm for more frequent migration (1 hour instead of 8 hours) m Issue with large files (above 2 GB) o Real data files are not ROOT files but open by ROOT o There was an issue with a compatibility library for slc4-32 bit on slc5 nodes P Fixed within a day m Wrong magnetic field sign o Due to different coordinate systems for LHCb and LHC ;-) o Fixed within hours m Data access problem (by protocol, directly from server) o Still dCache issue at IN2P3 and NIKHEF P dCache experts working on it o Moved to copy mode paradigm for reconstruction o Still a problem for user jobs: a pain! P Sites are regularly banned for analysis LHCb to LHCC and C-RSG review, PhC21 Transfers and job latency m No problem observed during file transfers o Files randomly distributed to Tier1 o Will move to distribution by runs (few 100s files) o For 2009, runs were never longer than 4-5 files! o Max file size set to 3 GB m Very good Grid latency o Time between submission and jobs starting running LHCb to LHCC and C-RSG review, PhC22 Resource requests LHCb to LHCC and C-RSG review, PhC23 Resource requests for m 2010 running o The requests were made in April-June 2009 P No additional resources expected P Try to fit within those requests o Running scenario for LHCb P March: 35% LHC 100 Hz P April-May-June: 50% LHC 1 kHz in average P July-August-September-half October: 2 kHz P no Heavy Ion run for LHCb P This corresponds to kHz P The request accounted precisely by chance for seconds ( ) P Therefore we use seconds for 2010 at 2 kHz trigger rate m 2011 running o Use the recommendation of MB P March: 35% LHC 2 kHz P April to mid-October: 50% LHC 2 kHz P Total running time: seconds m 2012: no run LHCb to LHCC and C-RSG review, PhC24 Resource requirements for LHCb to LHCC and C-RSG review, PhC25 kHEP06*year 2010 (old)2010 (confirmed)2011 (prelim.)2012 (very prelim.) Integrated PowerIntegratedPowerIntegratedPower CERN T CERN CAF - Analysis/Calib/Alignm ent CERN T0 + T Tier1s Tier2s Total Disk (TB) CERN T0 + T Tier1s Tier2s20 Total Tape (TB) CERN T0 + T Tier1s Total Comments on resources m Very uncertain and fluctuating running plans! m Depending on LHC running, MC requests may be different o Minimum bias, charm physics, b physics m Only after one year (at least) experience we can see how running analysis on the Grid works o Analysis at CERN? o Analysis at Tier3s? o Reliability for analysis? m 2012 is still very uncertain o No LHC running o Will the MC requests be the same as previous years o How many reprocessings? P Currently assume 1 full reprocessing of 2010 and 2 of 2011 LHCb to LHCC and C-RSG review, PhC26 Conclusions m Real data in 2009 o So few that it didnt impact resource usage o Was extremely valuable for P Setting procedures P Start understanding the detector d Already very promising performance after a few days 0 peak, and K 0 reconstruction P Exercising automatic processes m 2010 o Still expect somewhat chaotic running P Frequent changes in LHC settings, LHCb trigger commissioning o No change in LHCb resource requests w.r.t. June 2009 m 2011 o More precise requests with experience from 2010 m 2012 o Still very preliminary, but small increase only compared to 2011 LHCb to LHCC and C-RSG review, PhC27