ALICE: Offline Planning and personnel resources

  • Upload
    teddy

  • View
    26

  • Download
    1

Embed Size (px)

DESCRIPTION

ALICE: Offline Planning and personnel resources. LHCC Manpower Review of Computing September 3, 2003. Questions to be answered. Profile of available and required manpower at CERN / Regional Centers / Institutes Other resources existing and potential - PowerPoint PPT Presentation

Citation preview

  • ALICE: Offline Planning and personnel resources LHCC Manpower Review of ComputingSeptember 3, 2003

    ALICE : planning & resources

  • Questions to be answeredProfile of available and required manpower at CERN / Regional Centers / Institutes

    Other resources existing and potential

    Computing elements which will not be provided in case the required manpower and resources are not available

    Measures of progress in producing necessary software Management tools to track the progress Verification of the quality of the LCG software

    ALICE : planning & resources

  • ForewordLack of personnel in LHC computing (experiment & common HW/SW infrastructure) has been emphasized by LHC Computing Review (2001) and judged extremely worrying

    CERN and the Collaborations together must do all that they can to provide the HR that are needed for Core Software development

    The shortage has been alleviated for the LCG project by influx of computing professionals funded by member countries

    No such mechanism exists yet for experiments where the personnel shortage remains a problem

    ALICE has re-profiled the planning

    The data to be shown represent a bare minimum below which the readiness for data processing cannot be guaranteed.

    ALICE : planning & resources

  • Menu : Planning & ResourcesALICE Offline organization & managementStrategy for the Offline project, DC & milestonesPersonnel ressources : available and requestsAnswer to questions & conclusions

    ALICE : planning & resources

  • Organizatio Offline project mandate : Prepare software and computing infrastructure for experiments data processing (+DAQ, +HLT projects); Provide and maintain a complete infrastructure for simulation, reconstruction and analysis already during construction phase; Offline personnel for software developments: Core Offline project : minority, full time, located at CERN; Detector projects : most of the personnel, part time (preparation of apparatus), located in collaboration institutes; LCG provides common hardware and software infrastructure for LHC computing. nStrict coordination required to make the best usage of the personnel available.

    ALICE : planning & resources

  • Organization Management structureProject Leader & DeputyResourcesCoordinationPlanningCoordinationProductionEnvironment CoordinationFramework &Infrastructure CoordinationSimulationCoordinationReconstruction &Physics CoordinationCore Offline OfflineBoardUS GridCoordinationEU GridCoordinationSoftwareprojectsDetectorprojectsDAQHLTInt.Comp.BoardRegional TiersLCG SC2 GDB POB

    ALICE : planning & resources

  • Core Offline Work PackagesFramework and infrastructure coordinationSimulation coordinationReconstruction and physics coordinationProduction environment coordination

    ALICE : planning & resources

  • Organization Light weighted, single structureEfficient use of available personnel High adaptability to rapid changing technologyMerge framework developer (services providers) & physics algorithms developer (consumers)Maximize communicationEconomy of personnel (polymorphism of software experts)Rapid feedback to users requirements Management structure

    ALICE : planning & resources

  • Planning StrategyDynamic management of the work scheduleDevelop a long term software infrastructureMaintain the infrastructure in working state during detector constructionConstraintsDepend on the planning of external projects (LCG, EDG, EGEE)Most developers refer to detector projectsTake advantage of latest developments in fast evolving technologyNo personnel available for in depth planning activity Majority of personnel in Core offline project is temporary and with unpredictable skillsLight weighted and opportunistic strategy with flexible data challenges as high level milestones

    ALICE : planning & resources

  • Core team @ CERNA choice, not a necessity Need for a strong and centralized team of expertsTo facilitate coordination in all detector projects and all regional centersCERN, more than other ALICE groups, has the critical mass of people with the right skillsBenefit from co-habitation with ALICE managementAnd with LCG management Benefit from the attraction CERN exercises on young people with the right profile

    ALICE : planning & resources

  • Development strategyMinimize the effective amount of developmentChose mature and well tested productsROOT : Common HEP solution for: Data persistency at the file level, interface to various libraries, visualization, graphical user interface, virtual Monte-Carlo, geometrical modelerAliEn : The ALICE distributed computing environment all made with Open Source components based on Open Standards; 2 FTE for development, 0.5 for operation, in production since 2002Reduce staff and rely on temporary personnelHowever there is a threshold for staffDelegate well identified and modular packages to teams outside Core groupDetector data baseEDG/EGEE test bed

    ALICE : planning & resources

  • Data ChallengesStress-test the ALICE data model, DAQ hardware and software infrastructure with prototypes of increasing complexity until 2007 objectives are reached.Computing DC: record HI data at 1.2 Gbytes/s and export quasi online processing outside CERNPhysics DC: provide the infrastructure for organized Monte-Carlo production and world-wide random data-analysis

    ALICE : planning & resources

  • Computing Data ChallengeALICE & IT : Assess the MS requirements and evaluate available products (1998);Evaluate functions of DAQ, Offline, HLT projects ; Large-scale high-throughput distributed DC (4) to : Prototype the DAQ, Offline, HLT computing systems Verify their integration Assess technologies and computing models Test hardware and software components in realistic environment Achieve an early integration of the overall computing infrastructure

    ALICE : planning & resources

  • ilestonesM

    ALICE : planning & resources

  • Physics Data ChallengeObjectives : Prototype and test scalability of the components needed to simulate, reconstruct, and analyze data on distributed computing resourcesThree interlinked components : ROOTAliRootAliEn

    ALICE : planning & resources

  • Milestones* Fraction of events simulated in one year of standard data talking

    ALICE : planning & resources

  • PDC-III Resources estimate Simulation105 Pb-Pb + 107 p-pDistributed production, (partial) data replication at CERNReconstruction and analysisData source is CERN : 5106 Pb-Pb + 107 p-p Reconstruction at CERN and outside depending on resource availabilityResources (CPU and Storage)2004 Q1: 1354 KSI2K and 165 TB2004 Q2: 1400 KSI2K and 301 TBBandwidthSimulation in 2004 Q1~90 TB will be shipped to CERN in about 2 months ~10 days using 10% of the CERN bandwidth.

    ALICE : planning & resources

  • PDC-III resources profile

    ALICE : planning & resources

  • PDC-III resourcesDetails in the ALICE Data Challenges paper taking into accountResults of previous PDCEstimation of simulations in a standard year (2009)Storage: 200TB must be kept beyond the PDC end!!The numbers indicating the LCG resources for ALICE assume simultaneous use of the resources by all the experiments!A dynamic resource allocation would easily solve the deficitUSA quota to be confirmed

    ALICE : planning & resources

    Sheet1

    O3Q1O3Q2O3Q3O3Q4O4Q1O4Q2O4Q3

    CPU Requirements kSI2k13541400

    LCG Declared Capacity for ALICE941941

    Storage Requirements - total TB active data165301

    LCG Declared Capacity for ALICEDisk192192

    109Tapes578578

    109Total770770

    Sheet2

    Sheet3

  • Tracking progressMilestones set by the needs to prepare the Physics Performance ReportFull and fast simulationDetector reconstructionGlobal reconstructionProgress monitored by Physics DCCentral coordination at CERN (architect, librarian, multi-platform compatibility)Offline board takes the decision on framework evolution and review progressDevelopers implement during Offline week Code reviewed by experts

    ALICE : planning & resources

  • Verification of LCG software quality Grid technology area

    ALICE : planning & resources

  • Verification of LCG software quality Grid deployment area

    ALICE : planning & resources

  • Verification of LCG software quality Fabric area

    ALICE : planning & resources

  • ALICE Offline PlanningToday

    ALICE : planning & resources

  • Personnel Profile (task oriented)4 permanent staff persons

    Profile is build up with the assumptions that temporary personnel is NOT replaced*

    Evolution reported since 1998

    * Unrealistic scenario to emphasize fragility of the structure

    ALICE : planning & resources

  • Personnel Profile (task oriented) - 1/5

    Activity 98 99 00 01 02 03 040506070809Off-line CoordinationAvail.0.81.01.01.01.01.71.51.01.01.01.01.0Needed1.01.01.01.02.02.02.02.02.02.02.02.0Missing0.30.00.00.01.00.30.51.01.01.01.01.0DB and distributed computing infrastructureAvail.0.62.21.61.51.82.02.02.42.00.80.00.0Needed2.02.02.02.02.02.02.02.02.02.02.02.0Missing1.50.20.40.50.30.00.00.00.01.22.02.0Framework DevelopmentAvail.0.40.40.30.81.82.31.91.31.30.80.30.3Needed1.01.01.51.51.52.02.02.02.02.02.02.0Missing0.60.61.20.70.30.30.10.70.71.21.71.7Simulation frameworkAvail.1.92.02.83.03.33.02.82.01.51.01.01.0Needed3.03.03.03.03.03.03.02.02.01.51.01.0Missing1.11.00.30.00.30.00.30.00.50.50.00.0

    ALICE : planning & resources

  • Personnel Profile (task oriented) - 2/5

    ALICE : planning & resources

  • Personnel Profile (task oriented) - 3/5

    Activity 98 99 00 01 02 03 040506070809RadiationStudiesAvail.0.50.30.81.01.01.00.00.00.00.00.00.0Needed0.50.51.01.01.01.01.01.00.50.50.50.5Missing0.00.20.20.00.00.01.01.00.50.50.50.5SystemsupportAvail.1.01.81.51.01.01.01.01.01.01.01.01.0Needed1.01.01.51.01.01.01.01.01.01.01.01.0Missing0.00.80.00.00.00.00.00.00.00.00.00.0Analysissupport Avail.0.00.00.31.01.21.40.80.00.00.00.00.0Needed0.00.00.51.01.01.01.01.01.01.01.01.0Missing0.00.00.20.00.20.40.21.01.01.01.01.0

    ALICE : planning & resources

  • Personnel Profile (task oriented) - 4/5 Summary Core Offline team

    9899 00 01 02 03 040506070809Avail.6.89.811.813.716.118.414.910.08.05.64.34.3Needed11.511.515.716.517.518.018.517.517.016.516.016.0Missing4.81.73.92.81.40.43.77.59.010.911.711.7

    ALICE : planning & resources

  • Personnel Profile (task oriented) - 5/5Long build-up timeMust sustain plateau after 2003

    ALICE : planning & resources

  • Personnel Profile (post oriented)4 permanent CERN staff

    Temporary CERN personnel (no replacement assumed*)Staff LDTechnical and Physics studentsCERN Fellows

    Temporary CERN Project Associates (direct contribution from collaboration institutes + ALICE CERN exploitation budget ; no replacement assumed* )

    * Unrealistic scenario to emphasize fragility of the structure

    ALICE : planning & resources

  • Personnel Profile (post oriented) - 1/5Mostly temporary personnelSubstantial contribution from collaboration institutesROOT effect in 1999, AliEn effect in 2003

    ALICE : planning & resources

  • Personnel Profile (post oriented) - 2/5Only 25% permanent personnelMore than 60% are short/medium term personnel

    ALICE : planning & resources

  • Out-sourced projects - 1/3Detector DB by Physics Department and Computer Science Department @ Warsaw University : a single DB (economy of personnel) common to all detectors in the experiment

    ALICE : planning & resources

  • Out-sourced projects - 2/3EDG testbed validation and participation in various GRID projects by ALICE/Italy, ALICE/US, and the EDG/DataTAG project; to be continued with EGEE

    ALICE : planning & resources

  • Out-sourced projects - 3/3AliEN: basis of the ALICE distributed computing infrastructure : Coordination and main development by Core Offline group but several specific sub-tasks delegated to individuals at remote places

    ALICE : planning & resources

  • Ressources summaryDistribution of personnel for common offline activitiesAbout 40% of the work is distributed outside CERN

    ALICE : planning & resources

  • HLT SoftwareOnly personnel working on algoritms and simulation in collaboration with Offline projectPart of missing personnel should come from PhD students

    ALICE : planning & resources

  • LCG projects in application areaALICE has already made most of choices for critical issues (persistency, data DB, tracking, geometry descriptor, distributed computing, etc)Does not need to rely on common LCG applications To come : AliEn coupled with PROOF as generic architecture for LCG interactive analysis However ALICE contributes to common developments :GANIS ????

    ALICE : planning & resources

  • Other ressourcesUE project : one person to work full time on EDG for ALICEIndustry : Do not remember who???? : Code checkerEricson : AliEn what exactly ????Nasa : one person full time on the Virtual Monte-Carlo ?????

    ALICE : planning & resources

  • Offline in detector projects - 1/3AliRoot: An object Oriented framework which directly uses ROOT and provides:

    Many event generatorsTracking using Virtual Monte-CarloIO infrastructureSteering functionalitiesGlobal reconstruction

    Detector (13) tracking and reconstructionAnalysis

    ALICE : planning & resources

  • Offline in detector projects - 2/3No full-time dedicated developersSchedule defined by global milestones (DC)Planning is task oriented rather than personnel oriented

    ALICE : planning & resources

  • Offline in detector projects - 3/3 SummaryTotal39.737.335.835.8Needed8.613.314.414.4

    ALICE : planning & resources

  • Personnel resources in Offline project About 16% of the personnel at CERN, the remainder in collaboration institutes, no experiment dedicated personnel at regional centers.

    ALICE : planning & resources

  • Personnel resources in Offline project OUTSIDE INSTITUTES (84%)CERN (16 %)COLORS !

    ALICE : planning & resources

    Grafico2

    38

    51

    4

    10

    30

    Foglio1

    Analysis38

    Subtdetecor projects51

    HLT4

    Core offline NOT CERN10

    Core offline CERN30

    Foglio1

    Foglio2

    Foglio3

  • How to mitigate the lack of PersonnelThe ALICE off line project is committed to provide the collaboration with the adequate software to take and analyze data starting 2007.The project has already adapted its strategy to the lack of personnel and aims toward a bare minimum which enables to fulfill its tasks.The Core team cannot afford to lack more personnel without putting in danger the success of its goals.The severe lack of personnel in the detector projects will translate in lack of readiness in terms of accuracy in the algorithms and in lack of availability of categories of algorithms. Such a deplorable situation will have a negative impact on the quality of physics results.

    ALICE : planning & resources

  • ALICE priorities - 1/4Core Offline group at CERN : Less than 1/4 of personnel in Core Offline group at CERN are permanentMore than 50% are temporary personnel Dependence on availability of short term CERN positionsUncertainty on renewalsLoss of knowledge -- difficulty of knowledge transferDifficulty to cover key positions with people with the appropriate profileCompetition within ALICE in a fixed quota situation

    ALICE : planning & resources

  • ALICE priorities - 2/4Core Offline group at CERN : Have at least 1/3 of long-term personnel, limit use of fellows and students to 1/2, without changing the target number of FTEsEnsure the covering of key areas by converting two area coordinators (Production Environment, Framework & Infrastructure) now on temporary positions into CERN permanent staffAlleviate the volatility of Core Offline Team with at least two long term (6 years, LD-like) positions at CERN to replace short term ones (Detaching LCG personnel to ALICE would be a natural solution)Which profile/task????

    ALICE : planning & resources

  • ALICE priorities - 3/4Core Offline group at CERN :

    ALICE : planning & resources

  • ALICE priorities - 4/4Detector Offline at collaboration institutes : About 10 FTEs missing in the subdetector projects for software developmentsThis is a responsibility of the Institutes in charge of the subdetector projectsWe are working hard to find these peopleAdditional resources from funding agencies will have to be discussed case-by-case

    ALICE : planning & resources

  • Answer to questions - 1/4Profile of available and required manpower at CERN / Regional Centers / InstitutesCore offline group : 2 CERN staff + 2 long-term personnel would create satisfactory working consitions

    We have reached a equilibrium which enables to fulfill all the assigned tasks, however the equilibrium is fragile.

    ALICE : planning & resources

  • Answer to questions - 2/4Profile of available and required manpower at CERN / Regional Centers / InstitutesDetector groups : Most of the groups are understaffed ; personnel (about 10 FTE) dedicated to detector projects is systemically needed in the institutes Solution to found in the collaboration with the help on case-by-case basis of funding agencies

    ALICE : planning & resources

  • Answer to questions 3/4Other resources existing and potential A few occasional collaborations with industries Computing elements which will not be provided in case the required manpower and resources are not availableLack of readiness of algorithms or accuracy in algorithms Serious difficulties to interface ALICE software to LCG middlewareQuasi impossibility to adopt new LCG common software

    ALICE : planning & resources

  • Answer to questions - 4/4Measures of progress in producing necessary software :Because of the scare personnel available for the offline project a light weighted and dynamic organization has been adoptedThis organization has been so far been successful in producing a framework, detector software and a grid environment routinely used by the collaboration for detector design and physics validation.LCG software will be considered as soon as stable versions outperforming the software presently in use will become available.Milestones to test LCG middleware and fabric will be closely watched.

    ALICE : planning & resources

  • Conclusions - 1/2The core team of the offline project has adapted to the reduced personnel available and established its tasks and objectives accordingly.

    The edifice is fragile : any additional cut in (temporary) personnel might hinder the availability in due time of the software needed for data taking and analysis.

    Securing two staff positions is instrumental for the project success.

    ALICE : planning & resources

  • Conclusions - 2/2Adding 2-3 long term personnel to the core team would alleviate the unstable situation by making it less dependent on temporary personnel

    The lack of personnel fully dedicated to software development in the detector projects is worrisome as the lack of indispensable algorithms might dramatically delay first physics results from LHC.

    The needed personnel (not necessarely computer specialists) must be recruited by the institutes of the collaboration.

    ALICE : planning & resources

  • ALICE : planning & resources

  • DAQ-HLT Data Flow

    HLTDDL SIUDDL SIUDetector RODetector LDCD-RORCDDL DIUDDL SIU FEPHLT algorithmH-RORCDDL DIUHLT LDCD-RORCDDL DIUEvent Building Network (raw, HLT data, decisions)Storage NetworkGDCGDCGDCGDC~ 400 DDL~ 300 DDLTPC, TRD, MUON, ITS10 DDLPre(co)-processingMODE AMODE B&CTrigger

    ALICE : planning & resources

  • Tasks of offlineFramework and infrastructure coordination Framework development (simulation, reconstruction, analysis)Persistency technologyComputing/Physics data challenges with DAQ/HLTIndustrial joint projectsTechnology trackingLibrarian, CVS maintenance, test proceduresQA toolsSupport and documentation

    Core

    ALICE : planning & resources

  • Tasks of offlineSimulation coordination Detector simulationPhysics simulationPhysics validationG4 integrationFluka integrationRadiation studiesGeometrical modeler

    Core

    ALICE : planning & resources

  • Tasks of offlineReconstruction and physics coordination TrackingDetector reconstructionGlobal reconstructionAnalysis toolsAnalysis algorithmsPhysics data challengesCalibration and alignment algorithms

    Core

    ALICE : planning & resources

  • Tasks of offlineProduction environment coordination Production environment for simulation, reconstruction and analysisDistributed computing environmentData bases organization

    Core

    ALICE : planning & resources

  • Tasks of offlineWorld computing coordination Planning and resources coordination for LCG1&2Relations with national/international Grid projects

    Core

    ALICE : planning & resources

  • Computing needs for PDC IIIFlexibility of distributed computing modelAlternative scenarios

    ALICE : planning & resources

  • LGC resources pledged for ALICE in 2003USA quota to be confirmed

    ALICE : planning & resources

  • Objectives of PDC 3 The estimation of the number of events is essentially defined by the jet study105 events for jets with pt ~10-20 GeV/c 104 -- 105 events for studies on particle correlations and simple and double strangeness (,)106 events for high pt jets ( ~105 underlying events) , charmonium and bottonium into e+e- similar statistics, the same underlying events can be reusedCentrality 50% central events (b
  • Relation with HLTIn ALICE Offline, HLT and DAQ are three distinct projectsCooperation between HLT and Offline is goodHLT is using the common AliRoot framework to do simulationSome of the HLT algorithms have been integrated in the Offline framework for testingAs the Offline, HLT coordinates the activities of the different subdetector projectsHLT main trust is the definition of the HLT architecture (HW and SW) and some seminal work on algorithmsMore work on algorithms is done in collaboration with the subdetector projectsIntegration and testing of the three projects is also performed during DCs> NOT included are manpower for the online infrastucture, i.e. cluster> management, process communication infrastructure, monitoring,> FPGA coprocessor interface.

    ALICE : planning & resources

  • ALICE Data challengesROOTDAQROOT I/OCASTORSimulated DataCERNTIER 0TIER 1Raw DataRegionalTIER 1TIER 2GRID

    ALICE : planning & resources

  • ADC IV Hardware Setup22233333333Total: 192 CPU servers (96 on Gbe, 96 on Fe), 36 DISK servers, 10 TAPE servers210 TAPE servers(distributed)Backbone(4 Gbps)6TOTAL: 32 portsTOTAL: 18 portsCPU servers on FETBED0001-1213-2425-3637-4849-6061-7273-7677-88TBED0007D01D-12D13D-24D25D-36DLXSHARE89-1124 Gigabit switches3 Gigabit switches4 Gigabit switches2 Fastethernet switchesFibers

    ALICE : planning & resources

  • ALICE DC BW

    ALICE : planning & resources

  • ADC IV performances

    Event building with flat data traffic No recording 5 days non-stop 1800 MBytes/s sustainedEvent building and data recording With ALICE-like data traffic Recording to CASTOR 4.5 days non-stop to disk: ~ 140 TBytes 350 MBytes/s sustained

    ALICE : planning & resources