Ian Bird WLCG Overview Board; CERN, 28 th November 2014

Embed Size (px)

DESCRIPTION

WLCG MoU Topics  Russian Tier 1 sites  Letters received confirming resource pledges for 2015  Have infrastructure in place and will be treated as Tier 1 sites for However, this is pending the formal official approval of the WLCG MoU at government level as stated in the letters.  Mexico: UNAM signed as ALICE Tier 2 in November  Pakistan: COMSATS Inst. Information Technology (ALICE): MoU ready for signature  South Africa: Anticipating CHPC centre, Cape Town, Tier 2 ALICE, ATLAS  Tier 2 at KISTI (ALICE) now fully replaced by the Tier 1 facility November 28, 2014

Citation preview

Ian Bird WLCG Overview Board; CERN, 28 th November 2014 Outline MoU updates Status of KISTI and RU Tier 1s LHC Schedule Summary of October RRB Resource pledges status and outlook Progress report activities and plans H2020 proposals Data preservation and open data Some concerns Data protection Also reported on today Oracle licenses at Tier 1s HEP software foundation Data preservation/open data Update on EU-T0 possible activities (tbc) November 28, 2014 WLCG MoU Topics Russian Tier 1 sites Letters received confirming resource pledges for 2015 Have infrastructure in place and will be treated as Tier 1 sites for However, this is pending the formal official approval of the WLCG MoU at government level as stated in the letters. Mexico: UNAM signed as ALICE Tier 2 in November Pakistan: COMSATS Inst. Information Technology (ALICE): MoU ready for signature South Africa: Anticipating CHPC centre, Cape Town, Tier 2 ALICE, ATLAS Tier 2 at KISTI (ALICE) now fully replaced by the Tier 1 facility November 28, 2014 Total wall clock hours for ALICE jobs in 2014 KISTI, 6.11 % (Including Tier-2 ) KISTI: Jobs Apr 201Aug ,688 concurrent jobs = 28 kHS06 84 nodes, 32 (logical) cores per node, 10.5 HS06/core Stable and smooth running Few short incidents mostly due to the maintenance of undersea optical fibre link Completed 2M jobs in the last 6 months More than total processed jobs in the last year (1.8M) ~ 2500 ~ 100 (Queued Agents) November 28, Status Reports on Tier 1 in Korea5/14CERN-Korea Committee Meeting Network Upgrade Schedule Funding has been secured with a lot of supports from Korean government Only administrative procedures are left. 10Gbps upgrade will be completed by March, 2015 at least before RUN2 starts Dec. Oct.Mar. 2015May RUN2 Buffer (2 mon) Complete Contract Upgrade (3 mon) 2 mon Tender Notification From Seo-Young CKC in Oct. 2014 Status Reports on Tier 1 in Korea6/14CERN-Korea Committee Meeting Planed 10Gbps Network 10Gbps Upgrade As-Is: To-Be: 2G KREONET2 + 2G SURNet Dedicated Circuit 10G + 10G SURFnet Contracted provider will allocate the dedicated circuit 10G. From Seo-Young CKC in Oct. 2014 New Tier 1s, providing significant contributions June-Oct 2014 November 28, 2014 LHC November 28, 2014 Resource usage November 28, 2014 Pledges vs requests for 2015 November 28, 2014 Pledges vs requests for 2016 Pledges incomplete November 28, 2014 Requirements vs pledges November 28, 2014 NB For 2016, pledges are not yet complete, requirements not final until April RRB RSG expectations for Run 2 From C-RSG report November 28, 2014 From RRB report Remove efficiency factors from all accounting reports Data popularity now being monitored by most experiments Still work to understand the figures CRSG notes use of HLT farms Does not consider them opportunistic Noted progress in software improvements November 28, 2014 Tier 0 November 28, 2014 Current tender 2015 capacity 50 PB disk, 750 kHS06 (36 k cores) 2/3 in Wigner, 1/3 in Meyrin Growth during last month November 28, 2014 VMs created per hour Total number of VMs Understanding CERN efficiencies Significant level of CERN pledges are delivered as non-batch services (VOBoxes, services, etc 100s of machines) These have been included in the standard efficiency calculation but this is misleading Right-hand plot shows batch-only efficiency In future CERN will produce 2 reports: batch and total resources November 28, 2014 November 28, 2014 ALICE 1 For Run 2 expect doubled effective event output rate up to 8 GB/s Higher track multiplicity beam energy and pileup 25% increased event size additional detectors added Increased capacity of HLT and DAQ Significant efforts to improve reconstruction software, including: x2 speed-up in field calculation Use of TRD in track fit to improve resolutions HLT tracks used as seed for offline Improved alignments and calibration procedures Reduced memory requirements for calibration and reconstruction November 28, 2014 ALICE 2 Simulation: Switch to Geant4 physics validation ongoing, comparison of detector response to data; problem with TRD response CPU performance x2 worse than Geant3 even after significant improvements However, multi-threading capability gives access to new (HPC) resources Collaborating with other experiments to explore use of spare CPU cycles on supercomputers E.g. with ATLAS - BigPanda/AliEn interface on TITAN at ORNL Use of upgraded HLT farm Up to 3% additional resources Using openstack (as ATLAS, CMS, T0) November 28, 2014 ALICE 3 Analysis trains successfully moving most analysis work to trains, big efficiency improvement Big improvements in automation, better management and use of trains Factor 5 decrease in turnaround time since Jan 2012; able to run many more trains Production: Steady RAW and MC processing activities Re-processing of RAW data started as planned complete by April Full re-calibration and all software updates All Run 1 RAW data processing with the same software Soon to run associated MC general and PWG specific Now started ALICE re-commissioning Test of upgraded detectors readout, Trigger, DAQ, new HLT farm Full data recording chain, with conditions data gathering November-December - cosmics trigger data taking with Offline processing November 28, 2014 AliceO2 Framework Run 3 Re-use the existing building and testing infrastructure in FairRoot Use the transport layer and Dynamic Deployment Service from AlFa ITS simulation is implemented in AliceO2 More detectors will follow Libraries and tools AlFa Panda ROOT Cbm ROOT AliceO2 FairRoot https://github.com/AliceO2Group/AliceO2 22 November 28, 2014 ATLAS 1 Significant simulation activities in last 6 months 1.6 B events with full Geant4 simulation 1 B events with fast simulation 260 M events as input for DC14 as testing for Run 2 Consistently use ~40% more than pledges T1 job slots full T2s typically above pledges HLT when available 20k simultaneous jobs November 28, 2014 ATLAS 2 Have improved reconstruction time by a factor 3 Algorithmic improvements, code clean-up, use of optimised matrix and math libraries, move to 64-bit and SL6 OS Allows to operate at 1 kHz output rate from HLT, as desired for Run 2 No compromise of reconstruction quality New data management framework and new production management systems are being commissioned for Run 2 (DC14) New data management strategy optimise between disk and tape Introduction of data lifetime, automating clean up of little- used data on tape and disk November 28, 2014 ATLAS 3 New analysis model & framework Supports standardized use of performance recommendations (jet energy scale etc) Ensures work common between physics groups is done only once new train model replaces incoherent group production under validation Eliminated large intermediate datasets and different formats new dual-use xAOD Improved space occupancy and availability of data to users November 28, 2014 ATLAS 4 Simulation Integrated simulation framework in use Full Geant4 simulation Remains as (slow) gold standard Ongoing work to speed up Geant4 v 10 (multithreaded) implementation not for Run 2 start; timetable unclear Fast simulation New version of FastCaloSim development ongoing, 2 nd half of 2015 Fast digitization, truth-seeded reconstruction development ongoing Distributed Computing Services: ProdSys2 will replace ProdSys1 on Dec 1 New distributed data management system (Rucio) will replace DQ2 on Dec 1 November 28, 2014 CMS: Production 27 Simulation and Reconstruction has been steady during 2014 CSA14 and samples for Run I analysis were the bulk of the events The beginning of the Run II sample preparations can be seen in the simulation plot HLT up to 5k jobs CMS is validating a new MiniAOD format Fast to produce, saving analysis computing Small to store.Intended to cover the bulk of analysis use cases GEN- SIM AODSIM HLT CMS: Data access and job management Data Federation in Run II scale tests were performed in Europe and the US 20% of jobs were able to access data over the wide area 60k files/day, O(100TB)/day 28 CRAB3 was validated in CSA14 for specific workflows as planned Maintained a scale of about 10% Adoption to 50% by end 2014 CMS: Data Management in Run 2 Dynamic Data Placement Scripts to replicate heavily accessed data DDP engaged and replicated the samples during the stress test Improve and automate the monitoring of the disk usage Disk/Tape separation at Tier1s is complete 29 Software improvements required for Run 2 Multi-core transition Speed improvements & performance for more complex events expected in Run 2 E.g. Simulation x2 faster than last year; improved material model for tracker November 28, 2014 Multithreaded CMSSW applications Motivation for multithreaded jobs in Run II: Ensure processing of luminosity sections within single job despite higher trigger rates and increased event complexity Reduce number of GRID jobs to manage Reduce required memory per core Prepare CMSSW framework and algorithms for future technologies Reconstruction: Current performance meets goal for Run II Continue to improve performance scaling by making more algorithms safe. Big memory savings: 0.35 GB per additional thread instead of 1.8GB/job 30 LHCb 1 Simulation is the main production activity, mostly driven by analysis needs HLT farm is the largest single MC site Processing: Reprocess 2010 with consistent calibrations and sw with 2011,12 Full re-stripping of to create a legacy dataset Also commission new microDST format to be used in 2015 MC campaign for 2015 preparation November 28, 2014 LHCb 2 Software validation for Run 2 ongoing Move to ROOT 6 still some memory usage concerns Reminder: split HLT, online calibration, single offline reconstruction pass (no reprocessing before LS2) Goal: perform all necessary calibrations using HLT1 data and apply to both HLT2 and offline Status: Split HLT framework and online calibration framework implemented Online calibration and monitoring procedures defined Reconstruction quality demonstrated Conditions database modifications for automatic transmission of new constants to HLT2 and offline implemented Testing: Monthly end to end integration tests started in October November 28, 2014 LHCb 3 Full stream: prompt reconstruction as soon as RAW data appears offline Parked stream: safety valve, probably not needed before 2017 Turbo stream: no offline reconstruction, analysis objects produced in HLT Important test for Run3 Event reconstructed and selected online, write MDST-like output from HLT instead of the full RAW data Allows larger HLT rate due to x10 smaller event size To begin with (proof of principle) write also the RAW data But skip offline Reconstruction and Stripping steps November 28, 2014 November 28, 2014 H2020 project submissions EINFRA AARC (Terena, 24 months, 3 M) Authentication & Authorization for Research & Collaboration framework for federated identity platform (eduGAIN) EINFRA DPINFRA (Inmark, ES, 48 months, 8 M) Data preservation services infrastructure, for big-data science EGI-Engage (EGI.eu, 30 months, 8 M) Evolution of EGI INDIGO-DataClouds (INFN, 30 months, 11 M) Building a data/computing platform and tools for science, provisioned over hybrid (public+private) e-infrastructures & clouds RAPIDS (EBI, 36 months, 8.7 M) Shareable science-domain workflows and services (SaaS) over e- infrastructures to hide complexity; involvement of several EIRO labs ZEPHYR (PIC/IFAE, 24 months, 4 M) Prototyping & modelling of Zettabyte-Exascale storage systems for future science data November 28, 2014 November 28, 2014 Opendata.cern.ch Cern Open Data portal to access and analyse public data from all LHC experiments Policies and strategies discussed in LHCC in 2013 All experiments started to contribute: Training data (masterclass) and analysis tools CMS released: 27 TB of 2010 collision data storage, access and distribution of the open data, entry portal for tools and instructions. Is this the next (R)evolution? 37 Slide from Cristinel Diaconu at LHCC See Jamie Shiers talk today November 28, Computing November 28, 2014 LHCb: Pilot project; also uses BOINC, CernVM, controlled by Dirac Authentication issue: could make use of Data Bridge potential Outreach share and coordinate outreach efforts Volunteer Computing versus other opportunistic resources Technology CernVM and CernVMFS common denominator Areas for more work within HEP: CernVM image / CernVMfs, pre- cached images etc (Cloud working group) Share experience on common components, e.g. Data Bridge Opportunistic resources Volunteers among the general public Desktops in labs Cost of opportunistic capacity versus extra servers in the data centre? For some small Tier 2s Virtualisation with BOINC may be an interesting alternative If local storage is used at the Tier2, other Cloud technology like e.g. VAC may be more appropriate Volunteer computing see as extension of cloud model of experiments November 28, 2014 November 28, 2014 Data privacy Changes in EU laws on protection of personal information user consent is no longer sufficient Need to review and update our AUP and data protection policies BUT: Currently some of our (WLCG and experiment) information publication is potentially (now) illegal in many European countries Recent issue affecting xrootd (and sending of monitoring information to sites in US) Technical steps were taken to remove personal information, relocate end- point to CERN (or EU) Potential risk of breaking services if sites are required to stop services, data publication, etc. We need to be proactive and address this: New data protection policy being drafted Service developers middleware and experiment must be aware of this issue and address it correctly Will need to rapidly follow up when problems seen But, sites should also not be too precipitous in reaction without some discussion of the issue or concern November 28, 2014 Middleware support Concern over lack of support for such key pieces of software ARGUS (security infrastructure) Reinforces the ideas that we must move away from HEP-specific software as far as feasible In this instance discussing with OSG/EGI whether a convergence of efforts is possible Long-term this remains a significant concern November 28, 2014 Staffing concerns All experiments observe a decrease in effort for computing and software No longer supplemented by EC projects in the way that it has been Computing and software is not helpful for a physics career Difficult to find and retain people with appropriate skills Lack of a career path outside of Labs is a major concern Effort on Computing and software needs to be treated by the collaborations at the same level as detector building and other key tasks November 28, 2014 Strategy for next few years Reduce middleware dependencies and complexity Reduce operational load Helped by 1 st point Review of where effort is spent underway Ensure simply able to use any offered resources (grid, cloud, HPC, opportunistic, public, private, ) November 28, 2014 Summary Good progress in preparations & readiness for Run 2 Significant efforts by the experiments to live within realistic computing budgets Significant performance improvements achieved Resource outlook for 2015,16 good, uncertain on longer timescale Some concerns over long term support for middleware Strategy must be to reduce dependencies, simplify the needs, and be able to easily use opportunistic and other resources November 28, 2014