48

Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

  • Upload
    gwidon

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Tim Bell @noggin143 [email protected]. Understanding Mass and Agility OSCON 2014, Portland 23/07/2014. About Tim. Runs IT Infrastructure group at CERN Member of OpenStack management board and user committee Previously worked at Deutsche Bank running European Private Banking Infrastructure - PowerPoint PPT Presentation

Citation preview

Page 1: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014
Page 2: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Understanding Mass and Agility

OSCON 2014, Portland23/07/2014

Tim Bell@noggin143

[email protected]

23/07/2014 2OSCON - CERN Mass and Agility

Page 3: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

About Tim• Runs IT Infrastructure group at CERN• Member of OpenStack management board

and user committee• Previously worked at

• Deutsche Bank running European Private Banking Infrastructure

• IBM as a consultant and kernel developer

23/07/2014 3OSCON - CERN Mass and Agility

Page 4: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 4

CERN was founded 1954: 12 European States “Science for Peace”

Today: 21 Member States

Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, the Netherlands, Norway, Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and the United Kingdom Candidate for Accession: RomaniaAssociate Members in Pre-Stage to Membership: SerbiaApplicant States for Membership or Associate Membership:Brazil, Cyprus (awaiting ratification), Pakistan, Russia, Slovenia, Turkey, Ukraine Observers to Council: India, Japan, Russia, Turkey, United States of America; European Commission and UNESCO

~ 2,300 staff ~ 1,000 other paid personnel > 11,000 users Budget (2013) ~1,000 MCHF

OSCON - CERN Mass and Agility

Page 5: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

What are the Origins of Mass ?

23/07/2014 5OSCON - CERN Mass and Agility

Page 6: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Matter/Anti Matter Symmetric?

23/07/2014 6OSCON - CERN Mass and Agility

Page 7: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Where is 95% of the Universe?

23/07/2014 7OSCON - CERN Mass and Agility

Page 8: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 8OSCON - CERN Mass and Agility

Page 9: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 9OSCON - CERN Mass and Agility

Page 10: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 10OSCON - CERN Mass and Agility

Page 11: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Collisions

23/07/2014 11OSCON - CERN Mass and Agility

Page 12: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

A Big Data Challenge

23/07/2014 12

In 2014,• ~ 100PB archive with additional 35PB/year• ~ 11,000 servers• ~ 75,000 disk drives• ~ 45,000 tapes• Data should be kept for at least 20 yearsIn 2015, we start the accelerator again• Upgrade to double the energy of the beams• Expect a significant increase in data rate

OSCON - CERN Mass and Agility

Page 13: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

LHC data growth • Plan to record

400PB/year by 2023

• Compute needs expected to be around 50x current levels if budget available

23/07/2014 OSCON - CERN Mass and Agility 13

2010 2015 2018 2023

PBperyear

Page 14: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 14

Tier-1 (11 centres):• Permanent storage• Re-processing• Analysis

Tier-0 (CERN):•Data recording•Initial data reconstruction•Data distribution

Tier-2 (~200 centres):• Simulation• End-user analysis

• Data is recorded at CERN and Tier-1s and analysed in the Worldwide LHC Computing Grid

• In a normal day, the grid provides 100,000 CPU days executing over 2 million jobs

OSCON - CERN Mass and Agility

Page 15: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

The CERN Meyrin Data Centre

23/07/2014 15OSCON - CERN Mass and Agility

Page 16: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

New Data Centre in Budapest

23/07/2014 16OSCON - CERN Mass and Agility

Page 17: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Good News, Bad News

23/07/2014 OSCON - CERN Mass and Agility 17

• Additional data centre in Budapest now online• Increasing use of facilities as data rates increase

But…• Staff numbers are fixed, no more people• Materials budget decreasing, no more money• Legacy tools are high maintenance and brittle• User expectations are for fast self-service

Page 18: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Public Procurement CycleStep Time (Days) Elapsed (Days)User expresses requirement 0Market Survey prepared 15 15Market Survey for possible vendors 30 45Specifications prepared 15 60Vendor responses 30 90Test systems evaluated 30 120Offers adjudicated 10 130Finance committee 30 160Hardware delivered 90 250Burn in and acceptance 30 days typical with 380 worst case 280

Total 280+ Days

23/07/2014 OSCON - CERN Mass and Agility 18

Page 19: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Approach• There is no Moore’s Law for people

• Automation needs APIs, not documented procedures• Focus on high people effort activities

• Are those requirements really justified ?• Accumulating technical debt stifles agility

• Find open source communities and contribute• Understand ethos and architecture• Stay mainstream

23/07/2014 OSCON - CERN Mass and Agility 19

Page 20: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

O’Reilly Consideration

23/07/2014 OSCON - CERN Mass and Agility 20

Page 21: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Indeed.Com Consideration

23/07/2014 OSCON - CERN Mass and Agility 21

Page 22: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014

Bamboo

Koji, Mock

AIMS/PXEForeman

Yum repoPulp

Puppet-DB

mcollective, yum

JIRA

Lemon /Hadoop /

LogStash /Kibana

git

OpenStack Nova

Hardware database

Puppet

Active Directory /LDAP

22OSCON - CERN Mass and Agility

Page 23: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Puppet Configuration

23/07/2014 OSCON - CERN Mass and Agility 23

• Over 10,000 hosts in Puppet

• 160 different hostgroups• Tool chain using

• PuppetDB• Foreman• Git

• Scaling issues resolved with the communities

Page 24: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Monitoring - Flume, Elastic Search, Kibana

24

HDFS

Flumegateway

elasticsearch Kibana

OpenStack infrastructure

23/07/2014 OSCON - CERN Mass and Agility

Page 25: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 25

Microsoft Active Directory

CERN DB on Demand

CERN Network Database

Account mgmt system

Horizon

Keystone

Glance

NetworkCompute

Scheduler

Cinder

Nova

Block StorageCeph & NetApp

CERN Accounting

Ceilometer

OSCON - CERN Mass and Agility

Page 26: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

compute-nodescontrollers

compute-nodes

Scaling Architecture Overview

26

Child CellGeneva, Switzerland

Child CellBudapest, Hungary

Top Cell - controllersGeneva, Switzerland

Load BalancerGeneva, Switzerland

controllers

23/07/2014 OSCON - CERN Mass and Agility

Page 27: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Status• Multi-data centre cloud in production since July 2013

(Geneva and Budapest) with nearly 1,000 users• Currently running OpenStack Havana

• KVM and Hyper-V deployed• All configured automatically with Puppet• ~70,000 cores on ~3,000 servers• 3PB Ceph pool available for volumes, images and other

physics storage

23/07/2014 27OSCON - CERN Mass and Agility

Page 28: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

The Agile Experience

23/07/2014 OSCON - CERN Mass and Agility 28

Page 29: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Cultural Barriers

23/07/2014 OSCON - CERN Mass and Agility 29

Page 30: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Agility and Elasticity Limits• Communities help to set good behaviour• Internal demonstrations build momentum• Finding the right speed is key• Keeping up with releases takes focus• Coping with legacy requires compromise• Travel budget needs significant increase!

23/07/2014 OSCON - CERN Mass and Agility 30

Page 31: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Next Steps: Scale with Physics• Scaling to >100,000 cores by 2015

• Around 100 hypervisors per week with fixed staff• Deploying and configuring latest releases

• Need to stay close … but not too close• Legacy systems retirement

• Server consolidation• Home grown configuration and monitoring

• Analytics of processor, disk and network• Focus on efficiency

23/07/2014 31OSCON - CERN Mass and Agility

Page 32: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

IN2P3Lyon

Next Steps: Federated Clouds

Public Cloud such as Rackspace

CERN Private Cloud

70K cores

ATLAS Trigger28K cores

CMS Trigger12K cores

Brookhaven National Labs

NecTARAustralia

Many Others on Their Way

23/07/2014 OSCON - CERN Mass and Agility 32

Page 33: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Summary• Open source tools have successfully replaced CERN’s

legacy fabric management system• Scaling to 100,000s of cores with OpenStack and

Puppet is in sight• Cultural change to an Agile approach has required time

and patience but is paying off

Community collaboration needed to reach 400PB/year

23/07/2014 33OSCON - CERN Mass and Agility

Page 35: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Backup Slides

23/07/2014 35OSCON - CERN Mass and Agility

Page 36: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 36OSCON - CERN Mass and Agility

Page 37: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 37

http://www.eucalyptus.com/blog/2013/04/02/cy13-q1-community-analysis-%E2%80%94-openstack-vs-opennebula-vs-eucalyptus-vs-cloudstack

OSCON - CERN Mass and Agility

Page 38: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 38OSCON - CERN Mass and Agility

Page 39: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Monitoring - Kibana

3923/07/2014 OSCON - CERN Mass and Agility

Page 40: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Monitoring - Kibana

4023/07/2014 OSCON - CERN Mass and Agility

Page 41: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 41OSCON - CERN Mass and Agility

Page 42: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Architecture Components

42

rabb

itmq

- Keystone

- Nova api- Nova conductor- Nova scheduler- Nova network- Nova cells

- Glance api

- Ceilometer agent-central- Ceilometer collector

Controller

- Flume

- Nova compute

- Ceilometer agent-compute

Compute node

- Flume

- HDFS

- Elastic Search

- Kibana

- MySQL

- MongoDB

- Glance api- Glance registry

- Keystone

- Nova api- Nova consoleauth- Nova novncproxy- Nova cells

- Horizon

- Ceilometer api

- Cinder api- Cinder volume- Cinder scheduler

rabb

itmq

Controller

Top Cell Children Cells

- Stacktach

- Ceph

- Flume

23/07/2014 OSCON - CERN Mass and Agility

Page 43: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Upgrade Strategy• Surely “OpenStack can’t be upgraded”

• Our Essex, Folsom and Grizzly clouds were ‘tear-down’ migrations

• Puppet managed VMs are typical Cattle cases – re-create • User VMs snapshot, download image and upload to new instance• One month window to migrate

• Users of production services expect more• Physicists accept not creating/changing VMs for a short period• Running VMs must not be affected

23/07/2014 43OSCON - CERN Mass and Agility

Page 44: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Phased Migration• Migrated by Component

• Choose an approach (online with load balancer, offline)• Spin up ‘teststack’ instance with production software• Clone production databases to test environment• Run through upgrade process• Validate existing functions, Puppet configuration and monitoring

• Order by complexity and need• Ceilometer, Glance, Keystone• Cinder, Client CLIs, Horizon• Nova

23/07/2014 44OSCON - CERN Mass and Agility

Page 45: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Upgrade Experience• No significant outage of the cloud

• During upgrade window, creation not possible• Small incidents (see blog for details)• Puppet can be enthusiastic! - we told it to be

• Community response has been great• Bugs fixed and points are in Juno design summit• Rolling upgrades in Icehouse will make it easier

23/07/2014 45OSCON - CERN Mass and Agility

Page 46: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Duplication and Divergence

Service Silos Functional Layers

23/07/2014 OSCON - CERN Mass and Agility 46

Network

Hardware Facilities

Storage

Com

pute

Window

s

Web

Database

Custom

Network

Hardware Facilities

Infrastructure as a Service

Platform as a Service

Storage

Com

pute

Window

s

Page 47: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

Service Models

23/07/2014 47

• Pets are given names like pussinboots.cern.ch • They are unique, lovingly hand raised and cared for• When they get ill, you nurse them back to health

• Cattle are given numbers like vm0042.cern.ch• They are almost identical to other cattle• When they get ill, you get another one

OSCON - CERN Mass and Agility

Page 48: Understanding Mass and Agility OSCON 2014, Portland 23/07/2014

23/07/2014 48OSCON - CERN Mass and Agility