37
Migrating from Nagios to Prometheus NOV 07, 2019

Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Migrating from Nagios to Prometheus

NOV 07, 2019

Page 2: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

2

Linux (Ubuntu)

SDN (Cisco)

Chef

Terraform

Runtastic Infrastructure

Base

Linux KVM

OpenNebula

3600 CPU Cores

20 TB Memory

100 TB Storage

Virtualization

Physical

Hybrid

Big

Core DBs

Really a lot open source

Technologies

Page 3: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Our Monitoring back in 2017...

3

● Nagios○ Many Checks for all Servers○ Checks for NewRelic

● Pingdom○ External HTTP Checks○ Specific Nagios Alerts○ Alerting via SMS

● NewRelic○ Error Rate○ Response Time

Page 4: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Configuration hell….

4

Page 5: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Alert overflow...

5

Page 6: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Goals for our new Monitoring system

6

● Make On Call as comfortable as possible● Automate as much as possible● Make use of graphs● Rework our alerting● Make it scaleable!

Page 7: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Starting with Prometheus...

Page 8: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Prometheus

8

Page 9: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Our Prometheus Setup

9

● 2x Bare Metal● 8 Core CPU● Ubuntu Linux● 7.5 TB of Storage● 7 month of Retention time● Internal TSDB

Page 10: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Automation

Page 11: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Our Goals for Automation

11

● Roll out Exporters on new servers automatically○ using Chef

● Use Service Discovery in Prometheus○ using Consul

● Add HTTP Healthcheck for a new Microservice○ using Terraform

● Add Silences with 30d duration○ using Terraform

Page 12: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Consul

12

● Consul for our Terraform State● Agent Rollout via Chef● One Service definition per Exporter on each Server

Page 13: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Consul

13

Page 14: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

What Labels do we need?

14

● What’s the Load of all workers of our Newsfeed service?○ node_load1{service=”newsfeed”, role=”workers”}

● What’s the Load of a specific Leaderboard server?○ node_load1{hostname=”prd-leaderboard-server-001”}

Page 15: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

...and how we implemented them in Consul

15

{ "service": { "name": "prd-sharing-server-001-mongodbexporter", "tags": [ "prometheus", "role:trinidad", "service:sharing", "exporter:mongodb" ], "port": 9216 }}

Page 16: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Scrape Configuration

16

- job_name: prd consul_sd_configs: - server: 'prd-consul:8500' token: 'ourconsultoken' datacenter: 'lnz' relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,prometheus,.* action: keep - source_labels: [__meta_consul_node] target_label: hostname - source_labels: [__meta_consul_tags] regex: .*,service:([^,]+),.* replacement: '${1}' target_label: service

Page 17: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

External Health Checks

17

● 3x Blackbox Exporters● Accessing SSL Endpoints● Checks for

○ HTTP Response Code○ SSL Certificate○ Duration

Page 18: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Add Healthcheck via Terraform

18

resource "consul_service" "health_check" { name = "${var.srv_name}-healthcheck" node = "blackbox_aws"

tags = [ "healthcheck", "url:https://status.runtastic.com/${var.srv_name}", "service:${var.srv_name}", ]}

Page 19: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Job Config for Blackbox Exporters

19

- job_name: blackbox_aws metrics_path: /probe params: module: [http_health_monitor] consul_sd_configs: - server: 'prd-consul:8500' token: 'ourconsultoken' datacenter: 'lnz' relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,healthcheck,.* action: keep - source_labels: [__meta_consul_tags] regex: .*,url:([^,]+),.* replacement: '${1}' target_label: __param_target

Page 20: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Add Silence via Terraform

20

resource "null_resource" "prometheus_silence" {

provisioner "local-exec" { command = <<EOF ${var.amtool_path} silence add 'service=~SERVICENAME' \ --duration='30d' \ --comment='Silence for the newly deployed service' \ --alertmanager.url='http://prd-alertmanager:9093' EOF }

Page 21: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

OpsGenie

Page 22: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Our Initial Alerting Plan

22

● Alerts with Low Priority○ Slack Integration

● Alerts with High Priority (OnCall)○ Slack Integration○ OpsGenie

Page 23: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

...why not forward all Alerts to OpsGenie?

23

Page 24: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Define OpsGenie Alert Routing

24

● Prometheus OnCall Integration○ High Priority Alerts (e.g. Service DOWN)○ Call the poor On Call Person○ Post Alerts to Slack #topic-alerts

● Prometheus Ops Integration○ Low Priority Alerts (e.g. Chef-Client failed runs)○ Disable Notifications○ Post Alerts to Slack #prometheus-alerts

Page 25: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Setup Alertmanager Config

25

- receiver: 'opsgenie_oncall' group_wait: 10s group_by: ['...'] match: oncall: 'true'

- receiver: 'opsgenie' group_by: ['...'] group_wait: 10s

Page 26: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

...and its receivers

26

- name: "opsgenie_oncall" opsgenie_configs: - api_url: "https://api.eu.opsgenie.com/" api_key: "ourapitoken" priority: "{{ range .Alerts }}{{ .Labels.priority }}{{ end }}" message: "{{ range .Alerts }}{{ .Annotations.title }}{{ end }}" description: "{{ range .Alerts }}\n{{ .Annotations.summary }}\n\n{{ if ne .Annotations.dashboard \"\" -}}\nDashboard:\n{{ .Annotations.dashboard }}\n{{- end }}{{ end }}" tags: "{{ range .Alerts }}{{ .Annotations.instance }}{{ end }}"

Page 27: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Why we use group_by[‘...’]

27

● Alert Deduplication from OpsGenie● Alerts are being grouped● Overlook Alerts

Page 28: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Example Alerting Rule for On Call

28

- alert: HTTPProbeFailedMajor expr: max by(instance,service)(probe_success) < 1 for: 1m labels: oncall: "true" priority: "P1" annotations: title: "{{ $labels.service }} DOWN" summary: "HTTP Probe for {{ $labels.service }} FAILED.\nHealth Check URL: {{ $labels.instance }}"

Page 29: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Example Alerting Rule with Low Priority

29

- alert: MongoDB-ScannedObjects expr: max by(hostname, service)(rate(mongodb_mongod_metrics_query_executor_total[30m])) > 500000 for: 1m labels: priority: "P3" annotations: title: "MongoDB - Scanned Objects detected on {{ $labels.service }}" summary: "High value of scanned objects on {{ $labels.hostname }} for service {{ $labels.service }}" dashboard: "https://prd-prometheus.runtastic.com/d/oCziI1Wmk/mongodb"

Page 30: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Alert Management via Slack

30

Page 31: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Setting up the Heartbeat

31

groups:- name: opsgenie.rules rules: - alert: OpsGenieHeartBeat expr: vector(1) for: 5m labels: heartbeat: "true" annotations: summary: "Heartbeat for OpsGenie"

Page 32: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

...and its Alertmanager Configuration

32

- receiver: 'opsgenie_heartbeat' repeat_interval: 5m group_wait: 10s match: heartbeat: 'true'

- name: "opsgenie_heartbeat" webhook_configs: - url: 'https://api.eu.opsgenie.com/v2/heartbeats/prd_prometheus/ping' send_resolved: false http_config: basic_auth: password: "opsgenieAPIkey"

Page 33: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

CI/CD Pipeline

Page 34: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Goals for our Pipeline

34

● Put all Alerting and Recording Rules into a Git Repository● Automatically test for syntax errors● Deploy master branch on all Prometheus servers● Merge to master —> Deploy on Prometheus

Page 35: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

How it works

35

● Jenkins○ running promtool against each .yml file

● Bitbucket sending HTTP calls when master branch changes● Ruby based HTTP Handler on Prometheus Servers

○ Accepting HTTP calls from Bitbucket○ Git pull○ Prometheus reload

Page 36: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Verify Builds for each Branch

36

Page 37: Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

runtastic.com

THANK YOU Niko DominkowitschInfrastructure Engineer

[email protected]