27
OpenStack Operations and Upgrades Marco Passerini, CSCS April 2 th , 2019 HPC Advisory Council 2019 Swiss Conference

OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

OpenStack Operations and Upgrades

Marco Passerini, CSCS

April 2th, 2019

HPC Advisory Council 2019 Swiss Conference

Page 2: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

CSCS overview

▪ CSCS is the Swiss National Supercomputing

Centre

▪ Unit of the Swiss Federal Institute of Technology in

Zurich (ETH Zurich), located in Lugano

▪ CSCS's resources are open to academia,

industry and the business sector

▪ 2000 m2 machine room with no single supporting

pillar or any partitioning

2HPCAC 2019

Page 3: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Cloud computing

▪ Everything as a Service: IaaS, PaaS, SaaS, CaaS, DBaaS, FaaS…

▪ Pay as you go

▪ Fast resource availability, dynamic scaling

▪ No queuing systems!

▪ Capacity planning is important

▪ Modern authentication model: multitenancy, role based access, SSO

▪ Programmability: REST APIs, SDN, SDS, SDDC…

▪ Private, Public, Hybrid clouds

3HPCAC 2019

Page 4: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Use cases for cloud service offering at CSCS

▪ Scientific collaboratories, modern workflows, development

▪ HBP – Human Brain Project▪ PRACE

▪ CTA – Chernekov Telescope Array

▪ SDSC – Swiss Data Science Center

▪ MARVEL – National Centre of Competence in research

HPCAC 2019 4

Page 5: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

OpenStack

"OpenStack is a free and open-source software platform for cloud computing,

mostly deployed as infrastructure-as-a-service (IaaS), whereby virtual servers and

other resources are made available to customers.”

HPCAC 2019 5

▪ Web based dashboard, CLI, REST API

▪ 60+ projects

▪ Nova, Horizon, Keystone, Cinder, Swift, Neutron, Glance, Manila, Heat, Ironic, Octavia, Designate, Barbican, Mistral, Aodh, Telemetry, Magnum, Sahara, Trove, Monasca, Rally, Tempest, Zaqar, Murano, etc.

▪ ~19 releases since 2010▪ Austin, … , Mitaka, Newton, Ocata, Pike, Queens, Rocky, Stein, etc.

▪ Several deployment tools available: ▪ TripleO, Fuel, Bright OpenStack, Kayobe, Openstack Ansible, etc.

Page 6: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

OpenStack Deployment Timeline

HPCAC 2019 6

▪ 2016

▪ OpenStack feasibility evaluation

▪ 2017

▪ January

▪ Tambo POC deployment with RedHat consultant, Newton RHOSP10

▪ May

▪ Pollux: General purpose OpenStack

▪ Production deployment with RedHat consultant

▪ ~2 weeks

▪ Ocata RHOSP11

▪ August

▪ Pollux test system deployment by CSCS admins

▪ Ocata RHOSP11

▪ 2018▪ Spring

▪ Mythen deployment: HPC OpenStack

Page 7: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Pollux deployment constraints

▪ Supported deployment method

▪ Possibility of doing major upgrades

▪ Authentication through LDAP+KRB

▪ Federate services with external IdPs

▪ Integration with CSCS storage infrastructure

▪ SAN

▪ GPFS

▪ TSM

▪ Isolated network, with possibility of accessing CSCS

services in special cases

7HPCAC 2019

Page 8: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

RedHat support timeline

HPCAC 2019 8

Page 9: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

OpenStack Upgrade Timeline

HPCAC 2019 9

▪ 2018

▪ September

▪ Pollux TDS ~2 months, PROD ~2 weeks upgrade to Pike, RHOSP12

▪ December

▪ Pollux TDS ~1 month upgrade to Queens, RHOSP13

▪ 2019

▪ January

▪ Pollux PROD upgrade ~3 weeks (Queens, RHOSP13)

Page 10: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Pollux usage

HPCAC 2019 10

▪ ~322 VMs

▪ ~214 users

▪ ~42 projects

▪ VM uptime In 2018:

▪ 99.93% unplanned

▪ 99.76% unplanned and planned

Page 11: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Swiss HPC Tier-2 @ CSCS 11

Page 12: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Pollux system layout

HPCAC 2019 12

25x

docker

Page 13: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Pollux hardware configuration

Production Hardware:

▪ 1 director

▪ 3 controllers

▪ Lenovo 3650 M5▪ CPU: 2x Intel E5-2620 v4 8C▪ RAM: 128 GB▪ NIC: 1x Intel X710 (Dual 40 Gb), 1x IPMI, 1x 1 Gb

▪ 20 compute nodes▪ Lenovo 3650 M5▪ CPU: 2x Intel E5-2660 v4 14C▪ RAM: 512 GB▪ NIC: 1x Intel X710 (Dual 40 Gb), 1x IPMI, 1x 1 Gb

▪ 5 compute nodes (big mem)▪ HP DL360 G9▪ RAM: 768 GB▪ NIC: 1x HP 10Gb (Dual), 1x HP FDR 40Gb, 4x 1Gb

▪ 3 Ceph servers

▪ Lenovo 3650 M5▪ CPU: 2x Intel E5-2620 v4 8C▪ RAM: 128 GB▪ NIC: 1x Intel X710 (Dual 40 Gb), 1x IPMI, 1x 1 Gb▪ HDD:

▪ 120GB SSD local drives RAID1

▪ 18x SATA 2TB drives for data

▪ 6x SSD 400GB drives for journaling

▪ 4 Swift nodes on IBM CES Object (GPFS)

HPCAC 2019 13

Test system Hardware:

▪ 1 director

▪ 3 controllers

▪ 2 compute nodes

▪ 2 compute nodes (big mem)

▪ 3 Ceph servers

▪ 2 Swift nodes on IBM CES Object (GPFS)

▪ few nodes for GPU tests

CPU overprovisioning: 4x

• 224 vcores per node

• 5600 vcores in total

RAM overprovisioning: 1,4x

• no swap

1Gbps flavor capping

Page 14: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Pollux network layout

HPCAC 2019 14

▪ ~4000 reserved IPs for Pollux-PROD

only

▪ OVS

▪ DVR

▪ No bottlenecks, more bandwidth

▪ More availability, controllers can go

down and VM still run

Tenant Storage

Provisioni

ng & IPMI

Internal

API

Provider

VLANs

Floating IP

Storage

Mgmt

External +

Mgmt

External

API

Page 15: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

HDD

Data

Tape

BackupFlash

Metadata

Swift DBs

SAN

GPFS CES Object cluster

15

GUI

Monitoring

Mmbackup

HAproxy

OBJ S3 SwiftOnFile …

CES

Spectrum Scale

bever01 bever02 bever03 bever04

HPCAC 2019

• ACLs and roles • HBP data curators

• SwiftOnFile

• Cyberduck, ksproxy for SSO

• Backups:• Object versioning

• Disaster recovery to tape

• Cinder backups to Swift

• S3 Interface

Page 16: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

TripleO

▪ TripleO (OpenStack On OpenStack) is a program aimed at installing,

upgrading and operating OpenStack clouds using OpenStack's own cloud

facilities as the foundations - building on nova, neutron and heat to automate

fleet management at datacenter scale.

▪ RedHat’s deployment tool for OpenStack (RHOSP)

HPCAC 2019 16

Page 17: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

TripleO Architecture

HPCAC 2019 17

• Director node• Controller nodes

• Compute nodes

• Ceph nodes

Page 18: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

HPCAC 2019 18

Upgrade flow diagram

Page 19: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Upgrade sequential steps

▪ Minor upgrade (package update)

1. Backup

2. Validation

3. Undercloud upgrade

4. Node image upgrade

5. Template upgrade

6. Overcloud upgrade

HPCAC 2019 19

▪ Major upgrade (major software and

architectural changes)

1. Backup

2. Validation

3. Undercloud upgrade

4. Node image upgrade

5. Ceph migration to Ansible

6. Template upgrade

7. Container registry creation

8. Overcloud upgrade

9. Compute node upgrade

Page 20: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Main changes introduced by RHOSP12,13

▪ Configuration management

▪ Heat templates are staying

▪ Puppet is now in use together with Ansible

▪ Puppet with be discarded in RHOSP14

▪ Docker Containers to manage the OpenStack services

▪ Kolla, Paunch, Skopeo

▪ RH registry vs. Satellite registry vs. Local registry

▪ Customizations must be adapted to the new workflows

▪ Dockerfiles, Puppet, directory mounts from the hosts

HPCAC 2019 20

Page 21: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Configuration flow until RHOSP11

Heat templates

HPCAC 2019 21

Puppet modules OpenStack Configs

Page 22: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Configuration flow after RHOSP12

Heat templates

HPCAC 2019 22

Puppet/Ansible modulesOpenStack Configs

For Docker containers

Page 23: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

RHOSP upgrades’ results

▪ System was upgraded twice, successfully

▪ Few sites can do in place upgrades!

▪ Production upgraded in a couple of weeks

▪ Preparation required a 1-2 months of work though!

▪ Swift (IBM Spectrum Scale Object ) upgraded within a day

▪ Some, but limited unexpected downtime

HPCAC 2019 23

Page 24: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Main technical issues we faced during the upgrade

▪ Network problems with OVS

▪ OVS package upgrade might freeze the network, leading to a crash in RabbitMQ

▪ DVR didn’t work for some Kernel releases

▪ We discovered and solved many bugs in the upgrade process

▪ Issues with Ceph scrubbing states

▪ Ansible checks failing on service states when moving to containers

▪ Ceph keys not copied correctly

▪ Volume backups stopped working, DB/Puppet inconsistencies

▪ …

▪ Impossible to live migrate due to different CPU flags after Meltdown/Spectre

changes

▪ Queens dismissed Keystone V2, we had to reconfigure Swift and Cyberduck

HPCAC 2019 24

Page 25: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

TripleO specific issues

▪ Fast moving target

▪ customizations must follow

▪ Template structure is very complex

▪ Operations scaling got better in RHOSP13

▪ Software and configuration updates always apply to all nodes

▪ The system has to be 100% healthy before upgrading

▪ Restarting an upgrade after a failure might require manual work to fix the state

▪ Hard to do team work, workflow is serial

▪ Upgrades are buggy, we solved many issues on our own

HPCAC 2019 25

Page 26: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Pollux: on-going activities

▪ Cinder/Manila using SAN

▪ GPFS/Huawei/Netapp

▪ Filesystem cross mount across HPC and VM

▪ NFS+KRB

▪ OIDC authentication

▪ new external IDPs for federation

▪ Multidomain / multitenancy

▪ Ceilometer / Gnocchi

▪ Grafana, Kibana, Pentaho

26HPCAC 2019

Page 27: OpenStack Operations and Upgrades...OpenStack "OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby

Thank you for your attention.