34

HEPiX Spring Meeting 2015 University of Oxford, UK 2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

Embed Size (px)

Citation preview

Page 1: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:
Page 2: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

2

HEPiX Spring Meeting 2015University of Oxford, UK

http://indico.cern.ch/event/346931/

Arne Wiebalck

Julien Leduc

Adam Krajewski

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 3: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

3

HEPiX

• Global organization of service managers and support staff providing computing facilities for HEP community

• Participating sites include BNL, CERN, DESY,

FNAL, IN2P3, INFN, NIKHEF, RAL, TRIUMF …

• Meetings are held twice per year- Spring: Europe, Autumn: U.S./Asia

• Reports on status and recent work, work in progress & future plans

- Usually no showing-off, honest exchange of experiences

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 4: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

4

Outline • 2015 Spring Meeting & General HEPiX News• Site Reports (17)• Grids, Clouds, and Virtualization (8)

• Storage and File systems (8)• Computing and Batch (17)• IT Facilities (2)

• End User Services & Operating Systems (10)• Networking and Security (10)• Basic IT Services (7)

• Closing remarks

Arne

Julien

Adam

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 5: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

5

HEPiX Spring 2015• Mar 23 – 27, 2015 at the Physics

Department Oxford University, UK

• 134 registered participants (record!)- Many first timers again

- 75% from Europe, ~20 from 8 companies

- 45 different affiliations

• 83 contributions (+30%)- slots cut down to 25mins

- Ceph BoF, IPv6 tutorial

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 6: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

6Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 7: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

7

HEPiX Working Groups

• Benchmarking- Awaiting SPEC CPUv6

- Suggestion of a “fast” benchmark (minutes)- First test of a candidate provided by LHCb

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 8: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

8

Site Reports (1)• 17 site reports: about half from T0/T1

• HTCondor continues to be very visible - Many sites consider to move (e.g. DESY or KISTI)

- Mostly due to scalability issues with current solutions

- Feedback from sites running it is very positive

- INFN renewed LSF contract “for the last time”

• Config’ mgmt: Puppet still gaining popularity- Quattor flag held up by some (few) sites

- Ansible mentioned as well (NERSC) … (reminds me of Umeå 2009)

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 9: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

9

Site Reports (2)• Storage: Ceph clearly dominating reports …

- Some sites well advanced (e.g. BNL, RAL, CERN)

- Many sites exploring what to do with Ceph

• … but Lustre (re)gains some popularity

- Beyond GSI & JLAB, sites are considering deployment (e.g. NIKHEF)

- Apparently sites see a need for a distributed file system (specialised ones? DESY considers moving out of AFS)

• SL vs. CentOS: not a hot topic

- No rivalry, sites do not worry

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 10: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

10

Site Reports (3)

• Monitoring: being redone at several sites - With usual suspects: Flume, ES, Kibana, Grafana, …

• Cgroups started to be used more widely- Issues on various batch installations (kernel panics)

• IHEP: per user but managed VMs

- no root access, no console access

- an option for lxplus++ ?

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 11: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

11Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

GSI’s Cube: “3-d” CC- 6 floors (128 racks, 36k U)- PUE < 1.1- used for heating- cable length

More details next time!

Page 12: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

12

Virtualization (1)

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

• 8 talks, 2 from CERN- Bruno: Cloud report & Heat

• OpenStack community within HEP growing- Different approaches (e.g. IHEP only 3 images, RAL only 1 flavor)

- Mostly used for dev machines, some for services, few for compute

(normal virtualization phase-in)

• “ATLAS on Amazon” (BNL/AWS)- Practical feasibility of commercial clouds for ATLAS production – at full scale!

- Joint work with the AWS Scientific Computing Group

- Areas: compute (capacity?), networking (direct links?), storage (from “keep” to “delete”), … std vs scientific computing

- First test w/ 20k slots was economical, next test: 100k cores

http://indico.cern.ch/event/346931/session/9/contribution/20/material/slides/0.pdf

https://indico.cern.ch/event/346931/session/9/contribution/54/material/slides/1.pdf

Page 13: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

13

Virtualization (2)

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 14: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

14

Outline • 2015 Spring Meeting & General HEPiX News• Site Reports (17)• Grids, Clouds, and Virtualization (8)

• Storage and File systems (8)• Computing and Batch (17)• IT Facilities (2)

• End User Services & Operating Systems (10)• Networking and Security (10)• Basic IT Services (7)

• Closing remarks

Arne

Julien

Adam

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 15: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

15

Storage and File systems (1)• 8 talks

• 4 about CEPH• Panel and BoF: Ask the CEPH experts

• CEPH as a building block for many services• RACF at BNL

• CephFS went in production since 2014Q3• Awaiting RDMA support to ditch IBoE

• RAL• Ceph as a large scale object store to replace their Castor disk only

storage• Using Xrootd and GridFTP plugins• Testified about the experience of loosing monitors: using now 3

physical monitors physically distributed• Going to erasure coding: 3 replicas are too expensive, looking for

+30% HA overhead for Ceph storage

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 16: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

16

Storage and File systems (2)• Distributed File systems:

• GPFS: DESY Petra III data taking and analysis infrastructure is moving to GPFS after detector upgrades (DESY <=> IBM partnership)

• BeeGFS experience:DESY wants to use this as a replacement for GPFS and Lustre

• Former FhGFS from Fraunhofer, renamed in 2014

• Project will become opensource with commercial support available

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 17: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

17

Storage and File systems (3)• DESY experimenting with HGST

open Ethernet drive to build a dCache cluster

• Each disk:• Runs Linux (2GB RAM, disk is

sda, network is eth0), 60/4U enclosure

• They recompiled dCache pool code and run it directly on disks

• Future tests: reuse this HW to test Ceph deployment on disks

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 18: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

18

Computing and Batch (1)• 17 talks

• 8 benchmarking + 9 batch systems

• Commissioning cloud resources• Several simple metrics: wallclock, CPU

usage, data stage-in time, cvmfs software setup time allowing quick commissioning of cloud resources

• Stable cloud is easy to integrate in production

• Lot of efforts to optimize performance

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 19: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

19

Computing and Batch (2)• BNL remote evaluation of HW

• Need to speed up acquisition processes (partnership with vendors)

• Long acquisition processes <=> money lost

• Beyond HS06/fast benchmark• Candidates (SPEC CPUv6, Multithreaded

Geant4), mandatory compiler flags (-o2?)...• Fast benchmark LHCb fast benchmark,

HS06/LHCb ratio between 1.2 and 1.6  (but can go >2)

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 20: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

20

Computing and Batch (3)• Alternate CPUs:

• Intel Atom Avoton, Tegra K1 (ARM 32bit) extensively tested

• ARM 64bit software support is improving• Working on integration in CERN

environment (PXE boot, puppet, koji...)

• Test platforms available through CERN techlab

https://twiki.cern.ch/twik/bin/viewauth/IT/TechLab

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 21: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

21

Computing and Batch (4)• Univa GE is popular

• Only one to support DRMAA2 standard now

• HTCondor is more popular• Very large reactive community• Lot of additional tools developed by communities (HEP,

HCCondor)

• Monitoring CPU and memory usage with cgroups• Batch schedulers can isolate jobs in cgroups• Allow to understand resource utilization per type of jobs

(analysis, reconstruction,...) => refine scheduling policies

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 22: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

22

IT Facilities (1)• 2 talks from CERN

• Recent operational issues at CERN• 14/10/16 power incident (+ Murphy's

law)

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 23: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

23

IT Facilities (2)• Another operational

incident: Dust on tape incident

• Thanks to vendor impact was limited

• Development of a homemade dust sensor to monitor dust inside tape libraries at CERN

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 24: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

24

Outline • 2015 Spring Meeting & General HEPiX News• Site Reports (17)• Grids, Clouds, and Virtualization (8)

• Storage and File systems (8)• Computing and Batch (17)• IT Facilities (2)

• End User Services & Operating Systems (10)• Networking and Security (10)• Basic IT Services (7)

• Closing remarks

Arne

Julien

Adam

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 25: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

End User Services & OS (1)• 10 talks total, 5 from CERN

• Andreas: CERN Search and Social for the Enterprise Web Experience• Thomas: Evolutions in the CERN Conferencing Services Landscape• Arne: CERN CentOS 7 Update• Nils:

• Update on software collaboration services at CERN• Status of volunteer computing at CERN

• HEP Software Foundation• Collaboration started for HEP software/computing efforts (kickoff meeting April

2014, first workshop January 2015)• Objectives: sharing expertise, catalyzing common SW projects, promoting

collaboration in new developments• Website: http://hepsoftwarefoundation.org

• Scientific Linux Current Status• Development continues, SL 7.1 released on April 10th 2015• Researching containerization possibilities:

• Docker image• Scientific Linux Project Atomic distro

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary 25

Page 26: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

26

End User Services & OS (2)• SciDB at NERSC

• Testbed evaluation• Cluster of ~20 nodes, normally 100 GB – 1 TB data, even 20+ TB

• Happy with the results, decided to go with a production-level cluster

• Lustre at the Sanger Institute• 11 Lustre Volumes, 6 PB storage• Problems analyzing storage usage• Solved by implementing an efficient, parallel file tree walker using MPI

• Zimbra at DESY• Replacement of UNIX mail and Microsoft Exchange

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 27: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

27

Networking and Security (1)• 10 talks in total, 2 from CERN:

• Adam: Effects of packet loss and delay on TCP performance• Romain: Computer Security Update + phishing demonstration

• IPv6 Working Group• Lots of sites still not IPv6-ready (especially T2)• Testing and deploying dual-stack services if performance is sufficient• Dual-stack perfSONAR should be provided in 2015

• perfSONAR• Network and Transfer Metric Working Group started in May 2014• OSG datastore – community data store for all perfSONAR metrics --to

enter production in Q3 2015• Integrating perfSONAR with FTS and experiments to optimize transfers

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 28: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

28

Networking and Security (2)• WLCG Cloud Traceability Working Group

• Looking into incident traceability in emerging cloud computing environments• Best practices for gathering additional logging informations in cloud

frameworks, configuring VMs etc.

• Operational Security in the EGI and WLCG• Security policies: reporting vulnerabilities is essential • Only 8 incidents last year, quite successful prevention• Now re-working policies to face cloud computing technology threats

• OSSEC at Scotgrid Glasgow• Visualizing with Elasticsearch / Logstash / Kibana

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 29: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

29

Basic IT Services (1)• 7 talks, 3 from CERN:

• Alberto: Configuration management at CERN: Status and directions• Francisco: Towards a modernisation of CERN’s telephony infrastructure• Andrei: Updates from Database Services at CERN

• Config Management at RACF• Deployed Puppet Server in production

• Catalog compilation avg 1.97 sec -> 1.00 sec

• Looking into Jenkins CI for testing pending production changes• MCollective in testing, plans to put it in production

• MCollective at DESY• Succesfully deployed in production for following use cases:

• Steering Puppet agent runs• Querying the infrastructure• Small parallel-ssh tasks (e.g. package updates)

• Performance problems caused by SSH key plugin, now fixed

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 30: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

30

Basic IT Services (2)• Subtlenoise by Lancaster University

• Small framework to leverage acoustics during monitoring shifts• „produces low-impact but information-rich soundscapes in realtime”• https://github.com/ptrlv/subtlenoise

• Update on Quattor• Still in development, Quattor 15.2.0 released March 23rd 2015• ~15 institutes participating, over 2500 commits on GitHub in 2014• Active community

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 31: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

31

Outline • 2015 Spring Meeting & General HEPiX News• Site Reports (17)• Grids, Clouds, and Virtualization (8)

• Storage and File systems (8)• Computing and Batch (17)• IT Facilities (2)

• End User Services & Operating Systems (10)• Networking and Security (10)• Basic IT Services (7)

• Closing remarks

Arne

Julien

Adam

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 32: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

32

HEPiX Board News• Next meetings

- Autumn 2015: BNL (US) Oct 12 – 16 (to be held jointly with the WLCG GDB)

- Spring 2016: DESY Zeuthen (DE) April 18-22

- Autumn 2016: U.S. West Coast candidates, but also other proposals

• Discussions about swapping the European/US location cycle

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 33: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski:

33

Questions?

Wiebalck, Leduc, Krajewski: HEPiX Spring 2015 Summary

Page 34: HEPiX Spring Meeting 2015 University of Oxford, UK  2 Arne Wiebalck Julien Leduc Adam Krajewski Wiebalck, Leduc, Krajewski: