CERN IT Department CH-1211 Genève 23 Switzerland t The Agile Infrastructure Project Part 1: Configuration Management Tim Bell Gavin McCance

CERN IT Department CH-1211 Genve 23 Switzerland www.cern.ch/i t The Agile Infrastructure Project Part 1: Configuration Management Tim Bell Gavin McCance Slide 2 Configuration and Operations Tools https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure https://agileinf.cern.ch/jira/ IT Technical Forum 27 Jan 20122 Slide 3 Project scope The project is reviewing the entire CERN computer-centre management toolset What happens from the bare metal up Asset management, inventory Sysadmin tools and maintenance workflows Service management and configuration tools Dynamic configuration for virtual hosts Operations monitoring Workflow automation and continuous deployment IT Technical Forum 27 Jan 20123 Slide 4 Configuration and Operations Tools IT Technical Forum 27 Jan 20124 Slide 5 Why? Current production system built around the Quattor toolset is successfully managing 10k servers (CERN) Quattor + many CERN components Why are we changing the toolset? IT Technical Forum 27 Jan 20125 Slide 6 What are the issues Uncompressible technical debt The cost to develop and maintain our own solution is not reducing and clearly exceeds our resources Small community (less funding) and general support problem. At CERN, weve fallen into the sticky hands support model We need better automation and integration between the sub-components Lack of automated workflow: everything is a ticket emailScript : your added value in the process is often your CERN password The 15-min CDB commit walk context switch cost IT Technical Forum 27 Jan 20126 Slide 7 What are the issues Transferrable skills and training Learning curve for our tools is steep and remains high Its easier to hire people who have skills in a widely- used tool than your internal tools Depending on where you look IT Technical Forum 27 Jan 20127 Slide 8 Jobs adverts indeed.com IT Technical Forum 27 Jan 20128 Index of millions of worldwide job posts across thousands of job sites These are the sort of posts our departing staff will be applying for. Puppet Quattor Slide 9 Integration is hard IPv6, virtualisation, Windows Server all need a solution We could leverage lots of open source tools But piecemeal integration of these requires high investment due to our complex system Years of organic growth have made the system way too hairy Its often easier to reinvent rather than integrate Lack of dynamic-ness in the infrastructure We hack the config system for dynamic VMs Its critical to look at the system as a whole IT Technical Forum 27 Jan 20129 Slide 10 Where to look? Large ops community out there taking the tool chain approach whose scaling needs match ours: O(100k) servers, many apps Become standard and join this community IT Technical Forum 27 Jan 201210 Slide 11 Use Puppet for the core The tool space has exploded in the last few years In configuration management and ops Large, shared tool forges, and lots of experience Puppet and Chef are the clear leaders for the core tool other tools in our scope try to integrate with those Many large-scale enterprises use Puppet Its declarative approach fits better with what were used to Large installations: friendly, wide-base community and commercial support and training You can buy books on it IT Technical Forum 27 Jan 201211 Slide 12 Scaling challenges: nodes Currently we have O(10k) physical nodes IaaS approach: Moving to virtual machines More (smaller, load-balanced) service nodes VMs for raw compute (batch or pilot jobs) Homogeneous: compute + storage on the same node Add another computer centre, 24/48 SMT cores per node, you get 100k 300k virtual nodes to be managed 99.6% (1) node update success-rate means 1200 manual interventions to fix it (1) in a recent intervention on lxbatch IT Technical Forum 27 Jan 201212 Slide 13 Scaling challenges: people IT Technical Forum 27 Jan 201213 Many, diverse applications (clusters) managed by different teams..and 700+ other unmanaged Linux nodes in VMs that could benefit from a simple configuration system Slide 14 IT Technical Forum 27 Jan 2012 Agile Infrastructure 1 st Try First started investigating tools in September using part- time resources from CF, DB, DSS, GT, OIS and PES Trying iterative agile-sprint style (Scrum): short sprints, feedback, sprint review, visible Take first, best-guess at architecture and tool selection, iterate Mixed success with this agile style What works: Good visibility and reviews. Daily scrum meeting useful. Weekly review meeting open to management. What doesnt: The time boxing part of of Scrum sprints is hard with part-time resources The project planning now foresees more dedication of staff 14 Slide 15 Agile Infrastructure 1 st Try Were currently running: OpenStack as cloud software for virtual machines, image management, bulk storage Future IT forum presentation Puppet for the configuration management core with Foreman as a dashboard IT Technical Forum 27 Jan 201215 Slide 16 Foreman dashboard IT Technical Forum 27 Jan 201216 Slide 17 Agile Infrastructure 1 st Try Were currently running: OpenStack as cloud software for virtual machines, image management, bulk storage Future IT forum presentation Puppet for the configuration management core with Foreman as a dashboard None of the tools are perfect out-of-the-box ..but wed rather submit patches to a good open source tool than re- implement it Weve experienced very good community support: RFCs and patches are quickly accepted Very active community: often problems are fixed and missing features implemented before you even report them IT Technical Forum 27 Jan 201217 Slide 18 Agile Infrastructure 1 st Try Were currently running: yum for software distribution (replacing spma) git for template management: why git? Almost all the Puppet (and Chef) usage schemes out there assume you use git to handle the templates Many of the tools we can benefit from also assume git We should not be different from the rest of the community IT Technical Forum 27 Jan 201218 Slide 19 Puppet Client/server architecture puppetmaster: horizontally scalable Rails application X509 cert authenticated nodes: integrate with CERN CA IT Technical Forum 27 Jan 201219 Slide 20 Puppet Puppet runs on the client, applying the configuration changes It detects the current state and only runs if theres something to do It runs every few minutes new configuration will be ~immediately applied (fail-fast). This is a change from CDB where latent changes can be stacked up Normal mode is client-side compile (assume success) No more CDB commit waits Change from CDB: the compilation fails later Good monitoring is a pre-req: puppet sends reports back to the puppetmaster The Foreman tool can collect these for you IT Technical Forum 27 Jan 201220 Slide 21 Puppet language Puppet uses its own Ruby-like language for the templates to assert the desired state of the nodes With Ruby fall-back for hard stuff (weve only needed this once) Being declarative rather than procedural, there are quirks Takes a bit of practice to get it There are books, online docs, online cook-books, and a large community to help It dispenses with the need for ncm components All the work is done by puppet on the node itself you just provide the template part to assert what you want done Less software -> easier to move to new OS versions IT Technical Forum 27 Jan 201221 Slide 22 Externals Puppet uses an external DB for much of the configuration that we currently store in textual CDB templates Node function + hardware Moving a host between clusters is a DB update Your configuration can use variables the node detects itself e.g. reconfigure daemons based on where a newly live-migrated VM has found itself Query the compiled configuration of other hosts e.g. Open my firewall to the lxadm nodes IT Technical Forum 27 Jan 201222 Slide 23 Moving towards PaaS Parametrisable recipes Just fill in the blanks The aim is to make it easy to use pre-canned recipes without even touching a Puppet template e.g. stick a standard CERN SSO-enabled apache / mod_wsgi / Django server on my box with these parameters Moving us in the PaaS direction Ultimately, it would be better if you never even needed to log into this node (J2EE public service, IT web hosting service, MySQL service) IT Technical Forum 27 Jan 201223 Slide 24 Standard workflow IT Technical Forum 27 Jan 201224 check out from CDB update templates CDB commit run and check on test node notify with nc-client n minutes Iterate CDB on lxadm check out from git update templates git commit and push run and check on test node notify with mcollective 1 minute Iterate Puppet on lxadm check out from git on the test node update templates run puppet-apply check on test node notify with mcollective Iterate Puppet-apply on test node check on foreman check on node(s) check on foreman git commit and push Slide 25 Modernising our processes Our software processes for the computer centre are fairly limited fire-and-forget broadcasts to project-elfms and rather manual The manual test/ -> preprod/ -> prod/ template dance Our toolset RPMs are built on laptop and uploaded to swrep by hand Add standard CI (e.g. Jenkins, Bamboo, Cruise) and automated build (Koji) as the only route to get new packages into the CC .. then automate the testing e.g. suitably tagged RPMs are automatically deployed to /test nodes. IT Technical Forum 27 Jan 201225 Slide 26 Modernising our processes Were working out which of the many puppet / git models suits us code review, sign-off and automated notification for changes that will affect multiple clusters How to automate the test/preprod/prod advancement Pre-req is flexible monitoring and alarming you need to trust that an automation failure will be signaled to you Script-generated emails are banned Need good monitoring to hang these notifications on Integrate components rather than use emailScript Script-generated tickets (where your value in the process is your password), are banned IT Technical Forum 27 Jan 201226 Slide 27 Current tool snapshot (liable to change) IT Technical Forum 27 Jan 201227 Jenkins Koji, Mock Puppet Foreman AIMS/PXE Foreman AIMS/PXE Foreman Yum repo Pulp Yum repo Pulp Puppet stored config DB mcollective, yum JIRA Lemon git, SVN Openstack Nova Hardware database Slide 28 Preliminary timelines YearWhatActions 2011Agree overall principles 2012Prepare formal project plan Establish IaaS in CERN CC Production Agile Infrastructure Monitoring Implementation as per WG Migrate lxcloud Early adopters to Agile Infrastructure 2013LSD 1 New Data Centre Extend IaaS to remote CC Business Continuity Support Experiment App re-work Migrate CVI General migration to Agile with SLC6 and Windows 8 2014LSD 1 (to November) Phase out Quattor/CDB/ IT Technical Forum 27 Jan 201228 Aggressive schedule if we are to make it for new data centre Slide 29 Initial steps Decide on tools now and integrate them together to make a production setup (Q1) We can still change.. But were starting to commit Looking for early adopters (from Q1) In particular to understand the people-scaling / ACL issues: which of the git/puppet models is best? e.g. PES/OIS services: batch/VMs, JIRA, Drupal https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopte rs2012https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopte rs2012 Help with integration / coding Help with ideas Help with building the task list IT Technical Forum 27 Jan 201229 Slide 30 Summary IT has started a new project to move our infrastructure to a new toolset based around industry standard open source components Puppet for the core configuration tool Better integration between components Use of more modern software processes to aid deployment Better monitoring Engage with the community rather than re-implement Overall project scope is wider (future IT forums) Cloud and virtualisation, improved monitoring Please get involved early and give feedback https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure IT Technical Forum 27 Jan 201230 Slide 31 Backup slides IT Technical Forum 27 Jan 201231 Slide 32 Code ownership model The sticky hands support model (you touched it last!) Were working out an FE-based model where Code is owned by the related service Functional-Element Ownership confers the responsibility to maintain a decent standard config for the computer centre, and the responsibility to roll out new versions of that code/config Patches from interested people can be offered, and if you take them, you support them not the guy that gave you the patch IT Technical Forum 27 Jan 201232 Slide 33 mcollective and messaging mcollective is a notification framework Mix of CERNs not.d / wassh It broadcast instructions to run pre-canned tasks to nodes selected by a filter collects the results from the nodes then renders that result for the CLI e.g. restart all my webservers, do a puppet run now It requires a messaging framework that all nodes subscribe to (to receive the notification) Typically: AcvtiveMQ or RabbitMQ Both Openstack and our (future) monitoring system need a CC wide messaging system as well IT Technical Forum 27 Jan 201233

Documents

CERN IT Department CH-1211 Genève 23 Switzerland t The Agile Infrastructure Project Part 1: Configuration Management Tim Bell Gavin McCance