67
Docker, Monitoring and SLURM Specific Visualisations QNIBTerminal @ work

Docker, Monitoring and SLURM Specific Visualisationstorbenr/presentations/NeIC_presentation... · •Docker in a Nutshell • QNIBx Terminal Monitoring Inventory • SLURM Autogenerated

Embed Size (px)

Citation preview

ssDocker, Monitoring and SLURM Specific Visualisations

QNIBTerminal @ work

• Docker in a Nutshell • QNIBx

Terminal

Monitoring

Inventory

• SLURM Autogenerated Dashboards

2

Agenda

3

About Me

• Christian Kniep @CQnib, [email protected]

4

About Me

• Christian Kniep @CQnib, [email protected]

• >10y Iteration SysAdmin, SysOps, SysEngineer, R&D Engineer

DevOps @Locafox (hyper-scale web-service)

5

About Me

• Christian Kniep @CQnib, [email protected]

• >10y Iteration SysAdmin, SysOps, SysEngineer, R&D Engineer

DevOps @Locafox (hyper-scale web-service)

• Founder of QNIB Solutions Holistic System Management

Containerization of SysOps and Workload

Consultancy / Software Design & Development

Docker in a Nutshell

7

Multiple Guests

SERVER SERVER

Traditional Virtualisation Containerisation

8

Multiple Guests

SERVER

HOST  KERNEL

SERVER

HOST  KERNEL

Traditional Virtualisation Containerisation

9

Multiple Guests

SERVER

HOST  KERNEL

Userland

SERVER

HOST  KERNEL

Userland

Traditional Virtualisation Containerisation

10

Multiple Guests

SERVER

HOST  KERNEL

HYPERVISOR  (Type  II)

Userland

SERVER

HOST  KERNEL

Userland

Traditional Virtualisation Containerisation

11

Multiple Guests

SERVER

HOST  KERNEL

HYPERVISOR  (Type  II)

KERNEL

Userland

KERNEL KERNEL

SERVER

HOST  KERNEL

Userland

Traditional Virtualisation Containerisation

12

Multiple Guests

SERVER

HOST  KERNEL

HYPERVISOR  (Type  II)

KERNEL

Userland

KERNEL KERNEL

UserlandUserland Userland

SERVER

HOST  KERNEL

Userland

Traditional Virtualisation Containerisation

13

Multiple Guests

SERVER

HOST  KERNEL

HYPERVISOR  (Type  II)

KERNEL

SERVICE

Userland

KERNEL KERNEL

UserlandUserland Userland

SERVICE SERVICE

SERVER

HOST  KERNEL

Userland

Traditional Virtualisation Containerisation

14

Multiple Guests

SERVER

HOST  KERNEL

HYPERVISOR  (Type  II)

KERNEL

SERVICE

Userland

KERNEL KERNEL

UserlandUserland Userland

SERVICE SERVICE

SERVER

HOST  KERNEL

Userland

Traditional Virtualisation Containerisation

Docker

15

Multiple Guests

SERVER

HOST  KERNEL

HYPERVISOR  (Type  II)

KERNEL

SERVICE

Userland

KERNEL KERNEL

UserlandUserland Userland

SERVICE SERVICE

SERVER

HOST  KERNEL

Userland

Userland  (#1) Userland  (#2)

Traditional Virtualisation Containerisation

Docker

16

Multiple Guests

SERVER

HOST  KERNEL

HYPERVISOR  (Type  II)

KERNEL

SERVICE

Userland

KERNEL KERNEL

UserlandUserland Userland

SERVICE SERVICE

SERVER

HOST  KERNEL

Userland

Userland  (#1) Userland  (#2)

SERVICE SERVICE

Traditional Virtualisation Containerisation

Docker

HOSTcontainer1

17

Docker Internal View

• Containers are ‘grouped processes’ isolated by Kernel Namespaces (PID, network, mount, …)

resource restrictions applicable through CGroups

bash

ls -l

container2

apache

container3

mysqld

container4

slurmd

ssh

• 1/2 Day, July 16th @ISC High Performance Deep dive into the talking points

How Docker might impact System Operations & HPC Applications

Further discussion beyond what I am talking about today

18

Docker Workshop

• Full Day, September 28th @ISC Cloud&BigData

19

Docker Workshop #2

QNIBTerminal

• Framework of system container to spin up stacks SLURM

21

QNIBTerminal

• Framework of system container to spin up stacks SLURM

22

QNIBTerminal

• Framework of system container to spin up stacks SLURM

23

QNIBTerminal

• Framework of system container to spin up stacks SLURM

24

QNIBTerminal

1

2

3

QNIBMonitoring

• Current monitoring systems do not connect overlaying metrics with log events

use/build inventory system to provide connections usually hidden

users perspective and scope/context/background

26

QNIBMonitoring

• Current monitoring systems do not connect overlaying metrics with log events

use/build inventory system to provide connections usually hidden

users perspective and scope/context/background

• QNIBMonitoring provides open metrics system (system / application metrics, log aggregates)

log event framework, consuming/processing/visualise events

auto discovery / configuration through consul

27

QNIBMonitoring

28

QNIBMonitoring

• Logstash (Log/Event Monitoring)

29

QNIBMonitoring

• Grafana (Performance Monitoring)

30

QNIBMonitoring

• Overlay Metrics w/ Events

QNIBInventory

32

QNIBInventory

• Network Topology

33

QNIBInventory

• Installed Software

34

QNIBInventory

• SLURM Cluster

• Enrich Log/Events

35

QNIBInventory

1

2

• Enrich Log/Events • Help visualise connections

36

QNIBInventory

• Enrich Log/Events • Help visualise connections • Build up history

37

QNIBInventory

Cluster Use-Case

• Multiple backgrounds have to be considered Enduser (Engineer, Software Developer, Scientist)

Operation Personel

Management

• Psychology plays important role Local rationality / context

10.000ft Overview vs. verifying hypothesis vs. Reporting

Empower users to extend their domain knowledge by providing toolset

39

Context Sensitive Dashboards

• Small SLURM cluster couple of nodes, two user groups, couple of users

script & MPI workload

40

Cluster Usecase

• Small SLURM cluster couple of nodes, two user groups, couple of users

script & MPI workload

41

Cluster Usecase

srv backend consul

slurmctldslurmctld

compute0slurmd

compute<N>slurmd

Compute

• Small SLURM cluster couple of nodes, two user groups, couple of users

script & MPI workload

42

Cluster Usecase

srv backend consul

slurmctldslurmctld

compute0slurmd

compute<N>slurmd

Compute

• Small SLURM cluster couple of nodes, two user groups, couple of users

script & MPI workload

43

Cluster Usecase

srv backend consul

slurmctldslurmctld

compute0slurmd

compute<N>slurmd

Compute

carboncarbon

graphite-apigraphite-api

Performance

grafanagrafana

elasticsearch

• Small SLURM cluster couple of nodes, two user groups, couple of users

script & MPI workload

44

Cluster Usecase

srv backend consul

slurmctldslurmctld

compute0slurmd

compute<N>slurmd

Compute

carboncarbon

graphite-apigraphite-api

Performance

grafanagrafana

Log/Events

elasticsearch

logger logstash

kibana kiabana

kopf es-kopf

elasticsearch

• Small SLURM cluster couple of nodes, two user groups, couple of users

script & MPI workload

45

Cluster Usecase

srv backend consul

slurmctldslurmctld

compute0slurmd

compute<N>slurmd

Compute

carboncarbon

graphite-apigraphite-api

Performance

grafanagrafana

Log/Events

elasticsearch

logger logstash

kibana kiabana

kopf es-kopf

neo4j neo4j

Inventory

inventory QINBInv

elasticsearch

• Small SLURM cluster couple of nodes, two user groups, couple of users

script & MPI workload

46

Cluster Usecase

srv backend consul

slurmctldslurmctld

compute0slurmd

compute<N>slurmd

Compute

carboncarbon

graphite-apigraphite-api

Performance

grafanagrafana

Log/Events

elasticsearch

logger logstash

kibana kiabana

kopf es-kopf

neo4j neo4j

Inventory

inventory QINBInv

postgres postgres

galaxy

galaxy galaxy

• Live cluster Status Utilisation per cluster / user / user-group

SLA met by SysOps

Most common jobs, misbehaving enduser

47

Management Context

• Live cluster Status Utilisation per cluster / user / user-group

SLA met by SysOps

Most common jobs, misbehaving enduser

• Reports per day / user / job-type / …

48

Management Context

• Live cluster Status Utilisation per cluster / user / user-group

SLA met by SysOps

Most common jobs, misbehaving enduser

• Reports per day / user / job-type / …

• Capacity Planning utilisation over time, comparison of HW generations, global FS capacity

49

Management Context

50

SLURM Dashboard

51

SLURM Dashboard

• Nodes are connected to Partitions

52

SLURM Inventar

• Nodes are connected to Partitions • Jobs are connected to both

53

SLURM Inventar

54

SLURM Dashboard

• Live progress of SLURM job Monitor iteration speed to estimate workload behaviour

Get to know job while it’s running (instead of postmortem)

Introduce application profiling / log events (enhance feedback)

55

Enduser Context

• Live progress of SLURM job Monitor iteration speed to estimate workload behaviour

Get to know job while it’s running (instead of postmortem)

Introduce application profiling / log events (enhance feedback)

• Post Mortem Get detailed report after job has finished

56

Enduser Context

• Live progress of SLURM job Monitor iteration speed to estimate workload behaviour

Get to know job while it’s running (instead of postmortem)

Introduce application profiling / log events (enhance feedback)

• Post Mortem Get detailed report after job has finished

• MDO jobs depending on outcome and progression submit next iteration(s)

57

Enduser Context

58

SLURM Dashboard

59

SLURM Dashboard

• Live cluster Status USE method overviews (Utilisation/Saturation/Errors)

Anomaly detection (w/ and w/o humans)

Spotting abnormal behaviour

60

SysOps Context

• Live cluster Status USE method overviews (Utilisation/Saturation/Errors)

Anomaly detection (w/ and w/o humans)

Spotting abnormal behaviour

• Drill into monitoring verify hypothesis about incidents/problems

correlate events, metrics and inventory

61

SysOps Context

• Live cluster Status USE method overviews (Utilisation/Saturation/Errors)

Anomaly detection (w/ and w/o humans)

Spotting abnormal behaviour

• Drill into monitoring verify hypothesis about incidents/problems

correlate events, metrics and inventory

• Guid through ‘known problems’ close feedback loops provide confidence

62

SysOps Context

63

Central Logging

64

Galaxy

65

Galaxy Use-Cases

SLURM

66

Galaxy Use-Cases

SLURM Log Events

WORKFLOW

Metrics Inventory

• Model Assess Workflow in Galaxy Easy to grasp (in contrast to Hadoop, Spark, …)

Event triggered, Cronjob?

Using idle compute resources

67

Thank you!

• Contact [email protected]

@CQnib, @_qnib

• Web www.qnib.org (blog)

doc.qnib.org (Paper)

• Feel free… …ask questions (now / later)

…ask for a Demo