40

HEPiX Fall Meeting 2015 Brookhaven National Laboratory Upton NY, US 2 Arne Wiebalck Eric Bonfillou Alberto Rodriguez

Embed Size (px)

Citation preview

PowerPoint Presentation

HEPiX Fall Meeting 2015Brookhaven National LaboratoryUpton NY, US

https://indico.cern.ch/event/384358/2Arne WiebalckEric BonfillouAlberto Rodriguez PeonWiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

Outline3 2015 Spring Meeting & General HEPiX NewsSite Reports (21)Grids, Clouds, and Virtualization (14)End User Services & Operating Systems (5)

Storage and File systems (10)Basic IT Services (10)Computing and Batch (8)

IT Facilities (6)Security and Networking (8)

Closing remarks

ArneAlbertoEricWiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryHEPiX4Global organization of service managers and support staff providing computing facilities for HEP community Participating sites include BNL, CERN, DESY, FNAL, IN2P3, INFN, KEK, LAL, LBNL, NERSC, NIKHEF, RAL, TRIUMF

Working groups to tackle specific/current issues and topics - (Storage), (Virtualization), Configuration Management, IPv6, Benchmarking

Meetings are held twice per year- Spring: Europe, Autumn: U.S./Asia

Reports on status and recent work, work in progress & future plans- Usually no showing-off, honest exchange of experiences

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryHEPiX Fall 2015Oct 12 16, 2015 at the Brookhaven National Laboratory, Upton, NY, USCombined with GDB

110 registered participants (high!)- Many first timers again- 45% from Europe, 35% from North America, 10% Asia/Pacific, 10% from companies- 32 different affiliations

82 contributions, 1630 minutes

5

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

HEPiX Working GroupsBenchmarking:Experiments started to create their ownbenchmarks suites in addition to HS06(LHCb fast, root marks, ATLAS validation kit, ) Four areas for further work (from MB)CPU power as seen by the job slotWhole-server benchmarkingAccounting Storage/transport of benchmarking information(Machine-job features) Group being formed now, HEPiX experts will collaborate on first two items6

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummarySite Reports (1)HTCondor dominating batch system - 80% of the site reports mentioned HTCondor, with a mixture of CEs (HTCondor, ARC, CREAM)- LSF and UGE much less visible than at previous meetingsRequests for large memory jobs (4-6GB) - small fraction of total so far jobs so far, but tendency towards an increased RAM-to-core ratio Integration of diverse compute resources- Cover peak loads, use cheap opportunistic resources, incorporate HPC centers, federated clouds- FNAL: HEPCLOUD, BNL/AUS: AWS, CNAF: bank (!)

7

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummarySite Reports (2)Sync services become more popular- bwSync (KIT), DESYbox, IHEPbox, CERNbox

ZFS in production at various sites- Features wanted so much that some sites even consider FreeBSD!

Puppet dominating config mgmt system- Quattor flag still held up by some (very few) sites- Ansible gaining popularity (in prod at CSC, NERSC)

SL vs. CentOS: not a hot topic8

CERNBox

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

Site Reports (3)Continuing trend: Enable HEP labs to supportother sciences- SLAC reported this time- provide small to mid-range HPC solutions, incl. storage- provide assistance in developing computing models- pay-as-you-use (storage), LSF analytics for accounting/dashboards- considering a lease model for the hardware

New cooling system evaluated at PIC:https://www.youtube.com/watch?v=EZmm7P1mPZs

9

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummarySite Reports (4)10

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

Virtualization (1) 11

OpenStack seems to become the de-facto standard formanaging private clouds- OpenNebula mentioned as well

Interest in commercial cloud offerings,several open questions- how to spot and deal with performance variability & inhomogeneity?- how to handle assess presumed and perceived performance?- how to procure commercial cloud resources?

Presentation by D. Giordano et al.- Ran several procurements, developed benchmark suite

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

Virtualization (2) 12Performance- Optimizations dependencies- NUMA, pinning, huge pages, EPT

Pre-deployment testing- Small issues can have major impact

Performance monitoring- Need continuous benchmarks to detect performance changes

Containers being evaluated for various use cases- s/w development for various Linux flavors (BNL)- compute: short-lived single applications, very low performance overhead (BNL)- HTCondor support to come- service migration from VMs to containers & Mesos (RAL)- even on a CRAY! (NERSC)

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryVM sizes (cores)BeforeAfter4x 87.8%3.3%(batch WN)2x 1616%4.6%(batch WN)1x 2420%5.0%(batch WN)1x 3220.4%3-6%(SLC6 WN)

End-user services & OS 13SL update- SL5.11 is last version (EOL: March 2017)- SL6.7 released in Aug 2015 (driver updates)- SL7.1 released in Apr 2015 (OverlayFS in technical preview)- RHEL7.2 in beta- WIP: SL Docker images

Remaining talks from CERN- Linux@CERN and CC7 update- Self-service kiosk for Macs- Collaboration services stackWiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary

Outline14 2015 Spring Meeting & General HEPiX NewsSite Reports (21)Grids, Clouds, and Virtualization (14)End User Services & Operating Systems (5)

Storage and File systems (10)Basic IT Services (10)Computing and Batch (8)

IT Facilities (6)Security and Networking (8)

Closing remarks

ArneAlbertoEricWiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryStorage & File Systems10 talks, 2 from CERNAlberto Pace: Future home directory at CERNAlberto Pen: CvmFS deployment status and trends

CEPH usage increasing across sites RACFTwo different CEPH clusters that sum up to 1 PB of dataMain user of the deployment is the ATLAS Event ServiceRAL5,220 TB of storage clusterErasure Coding for a more efficient use of disk space but at the cost of CPU time and extra concurrency

15

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryStorage & File Systems (2)Large dCache deployments in a few US SitesFermilabDifferent technologies for use case (Bluearc/Lustre/EOS) but looking to consolidate into dCache on top of tapeBNLAlso using tape as backendPart of FAX (Federated ATLAS storage systems using XrootD)Using 10.7 PB of data out of 14.2 PB available

Two presentations from sponsor companiesDDN StorageReducing file system latency through parallelismIME parallel file systemWestern DigitalMachine Learning techniques to improve disk performance

16

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryBasic IT Services10 talks, 2 from CERNMiguel: CERN Monitoring Status UpdateAlberto: Update on configuration management at CERN

ELK (ElasticSearch, Logstash and Kibana) being consolidated as the monitoring infrastructure in most sitesAlso with some variations (Logstash/Flume, Kibana/Grafana)RAL considering InfluxDB + Grafana as an alternative to GangliaNERSC using RabbitMQ for data transport and collectd for statistics collectionReal time stream processing getting traction at CERN and LBNL

17

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryBasic IT Services (2)Puppet deployment at KITPuppet environments based on git branches generated by GitLab CIForeman for provisioning and host discoveryUsing standard eyaml for storing secrets

A few sites still on the Quattor side Active community (RAL, LAL, Brussels)Adopting Aquilon as replacement of (S)CDB as data store

Foreman and GitLab presentations by BNLGeneral description of what the software does and how they used them18

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryComputing & Batch8 talks, 2 from CERNHelge: Update on benchmarkingJerome: Future of batch processing at CERN

HTCondor is the most popular Batch systemMost of the sites use (or plan to use) HTCondorThe HTCondor team presented some of the new features in the 8.4 version, including:Support for dual-stack IPv4/IPv6Improved scalability and stabilityDocker supportBNL presented their strategy to accommodate multi-core jobs using partitionable slots

19

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryComputing & Batch (2)LSF and Univa Grid Engine also present in a few sitesIN2P3 very happy with the support provided by Univa although considering what to do after the expiration of the contract

Discussion on CPU benchmarking HS06 is well established but considered insufficient in some areas Doesnt measure performance in multi-core processors well Proposal to have a fast benchmark than can measure the performance of a job slot in minutesThe HEPiX benchmarking WG is being formed again to address these issues

20

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryComputing & Batch (3)Evaluation of NVMe drive disks by BNLNVMe eliminates latency and bandwidth limitations imposed by SAS/SATA controllersBenchmarks show ~100% more performance compared with SSDsStill an expensive technology but prices expected to go lower with time

21

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryOutline22 2015 Spring Meeting & General HEPiX NewsSite Reports (21)Grids, Clouds, and Virtualization (14)End User Services & Operating Systems (5)

Storage and File systems (10)Basic IT Services (10)Computing and Batch (8)

IT Facilities (6)Security and Networking (8)

Closing remarks

ArneAlbertoEricWiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryIT Facilities &Business Continuity (1)23

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryNERSC Computational Research and Theory Facility (CRT) Elizabeth BautistaFour story building, 140k square feet, cost: 143MUSDEntirely based on free cooling (about 23C all year)Server room heat used to warm up the officesRedundant power of 27MW dedicated for the building42MW non-redundant available in additionPUE 1.6 in 2010 before the reworkReplacement of UPS system and power distribution panelsIntroduction of oil immersion techniquesGCR Carnojet system, 4x 46U tanks capable to dissipate up to 45kWOil temperature is 50C maxExpect to achieve PUE of 1.05 to 1.1

IT Facilities &Business Continuity (6)28

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryEnergy services performance contracting abstract Michael Ross

Speaker covered a talk about contracts concluded by U.S federal agenciesAim of these contracts is to ensure efficient use of U.S federal data centers and reduce their energy footprintBasic idea is that infrastructure measures to reduce the energy footprint are paid for by the energy savings achieved over up to 25 years Once paid, the energy savings benefit entirely the organisation running the data centre

29

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryNews from the HEPiX IPv6 working group David Kelsey

North America ran out of IPv4 addresses on 24/09/2015, IPv6 reaches now 9% of global trafficWorking group has been running testbeds with gridftp and dCache FTS3 and XRootDVarious problems are being encountered every now and thenWhile testing is going on, for the experiments IPv6 is not the higher priorityHowever, the aim is to gradually move to dual-stack servicesMost importantly, IPv4 services must continue to work!Especially since a number of issues were identified in dual-stack systemsPerfSONAR is used to monitor IPv6 trafficAim is to obtain the same performance using IPv6 than with IPv4Today, CERN and six Tier-1 sites are IPv6 capableSecurity concerns and issues have been reviewed

Security & Networking (1)30

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryStatus on the IPv6 OSG software stack tests Edgar Fajardo Hernandez

Definition of compliance is that both client and server are dual-stackSoftware continues to work correctly if one is IPv4 onlyTests have been run between FNAL, Wisconsin and UCSD23/44 packages were tagged as fully compliant12/44 packages were tagged as not compliantStorage services are overall compliant except for HadoopFile access issues with Bestman and dCache clientsAuthorisation and authentication issues with VOMS, GUMS, gLexec and EDG mkgridmapCEs and job submissions workHTCondor and Gram successfully tested

Security & Networking (2)Security & Networking (3)31

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryNetwork infrastructure for the CERN data centre Eric Sallaz

Goals are to support 10GbE, 40GbE and 100GbE network infrastructureMostly optical network compatible with LC connectors as well as copperLow bandwidth devices still use 1GbE copper networksEven more than before, the inspect and test before you connect practice is enforcedIssues spotted with dirty fibers, damaged connectors, etc.Evolving the aging (1995) copper infrastructure is based on standard methodsFor high speed networks (>10GbE) new types of cables have to be consideredOM3 for length up to 100mOM4 for length up to 150mWith different cables comes different type of connectorsMPO-12, MPO-24, etc.Coping with heterogeneity is mandatory!

32

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryWCLG Network and transfer metrics WG after one year Shawn McKee

Working group created a year ago, progress achieved via monthly meetingsUses cases have been translated into measurable metrics and reported at CHEPPerfSONAR as the main monitoring toolAllows to differentiate between network and application issuesData is made widely available via various mechanisms in collaboration with OSG and EsnetA concrete example, targeting FTS transfers, showed asymmetries between ATLAS and CMSAn observation is that the infrastructure is fine tuned for long flows and big filesIs this the most appropriate choice over time?A proposal suggests the implementation of an (automated) unit to intervene on network issuesFor the time being, the collaboration must continue to collect more data from the metricsSecurity & Networking (4)33

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryUpdate on WLCG/OSG PerfSONAR infrastructure Shawn McKee

Goals are to identify network issues and characterize the network useCurrent PerfSONAR deployment is composed of 278 instances, 245 being activeMetrics provided by PerfSONAR for faster and more precise analysisSome standard metrics include packet loss, delays, bandwidth measurement, packet loss, etcData collected from the PerfSONAR infrastructure is earmarked to be stored on a long runPath analysis is supported on PerfSONAR 3.5 used now by most sitesDeployment of the software still requires physical hardwarebut powerful VMs are being evaluated which will reduce the cost of the infrastructure

Security & Networking (5)34

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryUsing VPLS for VM mobility Carles Kishimoto Bisbe

CERN data centers has 1300+ racks spread over Meyrin and WignerBoth sites are interconnected via 100GbE links, network is routed and no VLANs are configuredThe need to move VMs is real, mostly triggered by decommissioning of aging hypervisorsThe difficulty resides in migrating VMs transparently using the existing network infrastructureA solution based on VPLS (Virtual Private LAN services) at the router level has been designedIt requires router configuration but also some additional cabling to ensure proper routingIt is achieved by putting in place loops on the routers, allowing traffic to pass on the proper pathTesting is successful and workflows are being put in place to go for production

Security & Networking (6)35

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryComputer security update Liviu Valsan

No major evolution since HEPiX spring 2015 except that exploit kits are always more advancedMobile devices are increasingly targetedCrypto-lockers are being used at higher frequency generating large profits for attackersMost attacks are targeting largely deployed software like IE and flash pluginsUnfortunately, large phishing campaigns are still on-going and sometimes partly successfulSoftware protections are not always enough to prevent issues, mostly due to delays in integrating patches against malicious codeA thorough example was presented covering poorly designed commercial software and weak recommendations on how to use it (from the company selling it)Usual recommendations to prevent security breaches were reminded

Security & Networking (7)36

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryBuilding a large scale security operations center Liviu Valsan

Designing a centralized and unified platform for ingestion, storage and analytics for multiple data accessScale-out integration within the CERN IT ecosystem using bare metal hardware but also OpenStack VMsSeveral components being successfully used were presented:Bro: network analysis frameworkMISP: Malware Information Sharing PlatformCIF: Collective Intelligence FrameworkCollected data is being stored and process in Hadoop, about 500GB /day is generatedServices are not necessarily the main targets but admins and users areTalk ended by the analysis of a phishing e-mail sent to HEPiX participants Security & Networking (8)Outline37 2015 Spring Meeting & General HEPiX NewsSite Reports (21)Grids, Clouds, and Virtualization (14)End User Services & Operating Systems (5)

Storage and File systems (10)Basic IT Services (10)Computing and Batch (8)

IT Facilities (6)Security and Networking (8)

Closing remarks

ArneAlbertoEricWiebalck, Bonfillou, Peon: HEPiX Fall 2015 SummaryHEPiX Board NewsNew website finally on-line (hosted at DESY)

Next meetings

- Spring 2016: DESY Zeuthen (DE), April 18-22, 2016

- Fall 2016: LBNL Berkeley (US), Oct 17-21, 2016 (back to back with CHEP which is the week before)

- Spring 2017: firm proposal for a European meeting

- Discussions about swapping the European/US location cycle, consider Asia

38

Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary39Wiebalck, Bonfillou, Peon: HEPiX Fall 2015 Summary