Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure...

Exploring History with HawkAn Introduction to Cluster Forensics

Kristoffer GrönlundHigh Availability Software Developer

kgronlund@suse.com

This tutorial

• High Availability in 5 minutes

• Introduction to HAWK‒ What's new in HAWK 2

• History Explorer‒ Cluster Forensics

‒ Example Usage

• Summary

About me

• Kristoffer Grönlund‒ Developer

‒ crmsh

‒ hawk

‒ resource-agents

‒ Maintainer

‒ fence-agents

‒ haproxy

High Availability

What is a cluster?

• Cluster → 1 - 32* Nodes

• Node → Single machine in cluster‒ Hardware or virtualized

‒ Remote nodes

• Site → Physical location‒ Local

‒ Metro

‒ Geographical

* Scale beyond 32 nodes with remote nodes

Resources

• Agent Classes‒ Open Cluster Framework (OCF) Agents

‒ resource-agents

‒ systemd services

‒ Fencing agents

‒ Init scripts

• Examples:‒ Web Server, File Server

‒ Databases

‒ Filesystems, IP Addresses

‒ VMs, resources in VMs...

Constraints

• Order‒ Start resource A before resource B

• Location‒ Resource A prefers node

• Colocation‒ Resource A with resource B

• Score‒ Mandatory vs. Preference

‒ Numeric value or +/- infinity

‒ Resource stickiness

Overview

Corosync

Messaging / Infrastructure

Resource Allocation

Resource Agents

ResourceResourceResource

Resource

Local Resource Manager Local Resource

Manager

Cluster Resource Manager

Policy Engine Cluster Information Base (CIB)

CIB Replica Cluster Resource

Manager

Corosync

Designated Coordinator (DC)

Fencing

• Dealing with Schrödinger's cat

• Goal: Preventing corruption

• Storage based: SBD‒ Recommended if possible

‒ No special hardware required

• Hardware based: IPMI, iLO, …‒ Many supported devices

• crmsh‒ Command line interface

• HAWK‒ Web interface

Learn more

• www.suse.com/documentation/sle-ha-12/

• Two node cluster in two commands

node1 # ha-cluster-init

node2 # ha-cluster-join -c node1

Introducing HAWK

HAWK - Overview

• “High Availability Web Konsole”

• Monitoring

• Configuration / Administration

• Dashboard

HAWK - Technical details

• Installed by ha-cluster-bootstrap

• Runs on the cluster nodes

• Ruby on Rails

• https://<node>:7630/

HAWK - Security

• Default user is hacluster

‒ Remember to change the password

• HTTPS for secure access

• Replace SSL certificate with your own‒ /etc/hawk/hawk.key

‒ /etc/hawk/hawk.pem

HAWK 0.7

Status

Dashboard

HAWK 2

A New Look

• Complete visual overhaul‒ More intuitive

‒ Similar to other SUSE tools

• Improved features‒ History Explorer

‒ More powerful wizards

‒ Integrated help

• Supports new cluster features

Upgrading to HAWK 2

zypper install hawk2

Status

Dashboard

Simulator

Simulator, node event

Simulator, results

Creating resources

Command log

Wizards

• Apply a complete cluster configuration

• Helps configuring constraints and groups

• Install and configure required software

Wizards

Wizard, configuration

Wizard, verify changes

Wizard, advanced options

Wizard, optional steps

Wizard, verify changes (1)

Wizard, verify changes (2)

Command line wizards

crm script

show virtual-ip

verify virtual-ip id=admin-ip ip=10.13.37.42

run virtual-ip id=...

History Explorer

Cluster Forensics

• Something went wrong‒ How can we figure it out?

‒ Pitfalls

• Understanding the cluster logs‒ Use the history explorer

‒ Get a cluster report

Root Cause Analysis

• Start at the evidence

• Trace backwards

• Know the application

• Assume you know nothing

Jumping To Conclusions

• Always stay on the evidence

• When the evidence runs out, we are guessing

• Guessing is OK!‒ But know when you are guessing

The Evidence

• Failed Cluster Action‒ Software bugs, crashes

‒ Configuration error

• Failed Node‒ Hardware failure

‒ Communication error

Collecting data

crm report -f '2015-10-10 12:00' -t '2015-10-10 14:00' strange_event

Understanding the logs

2015-10-11T19:40:11.717167+02:00 sle12sp1a crmd[1590]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]2015-10-11T19:40:19.777412+02:00 sle12sp1a apache(srv2)[20777]: INFO: Successfully retrieved http header at http://localhost:80002015-10-11T19:40:24.524292+02:00 sle12sp1a crmd[1590]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]2015-10-11T19:40:24.528651+02:00 sle12sp1a pengine[1589]: notice: Restart admin_addr#011(Started sle12sp1b)2015-10-11T19:40:24.528851+02:00 sle12sp1a pengine[1589]: notice: Calculated Transition 156: /var/lib/pacemaker/pengine/pe-input-55.bz22015-10-11T19:40:24.530055+02:00 sle12sp1a crmd[1590]: notice: Processing graph 156 (ref=pe_calc-dc-1444585224-290) derived from /var/lib/pacemaker/pengine/pe-input-55.bz22015-10-11T19:40:24.530701+02:00 sle12sp1a crmd[1590]: notice: Initiating action 16: stop admin_addr_stop_0 on sle12sp1b2015-10-11T19:40:24.740118+02:00 sle12sp1a crmd[1590]: notice: Initiating action 6: start admin_addr_start_0 on sle12sp1b2015-10-11T19:40:24.801183+02:00 sle12sp1a crmd[1590]: notice: Initiating action 1: monitor admin_addr_monitor_10000 on sle12sp1b2015-10-11T19:40:24.836022+02:00 sle12sp1a crmd[1590]: notice: Transition 156 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-55.bz2): Complete

Internal components

• Cluster Information Base (CIB)

• Cluster Resource Management daemon (crmd)

• Local Resource Management daemon (lrmd)

• Policy Engine (pengine)

• Fencing daemon (stonithd)

Policy Engine

• Designated Controller (DC)‒ Elected automatically

‒ Calculates ideal cluster state

‒ Decides on actions to achieve state

Transition

• Sequence of actions to reach new state

• Records state before and after transition

• Saved to /var/lib/pacemaker/pengine/

• Numbered with sequence number‒ Number sequence may reset to 0 if DC is re-elected

Cluster Actions

• <resource>_<action>_<nn>

• Actions‒ start

‒ stop

‒ promote

‒ demote

‒ monitor

‒ migrate_to

‒ migrate_from

Cluster Actions

• Error Codes

0: Success

1: Generic Error

2: Argument Error

3: Unimplemented Action

4: Insufficient Permissions

5: Required Component Is Missing

6: Configuration Error

7: Resource Was Not Running

8: Running As Primary

9: Failed As Primary

Cluster Action Failure

• Unexpected result when performing action

• Triggers transition

• May also trigger fencing (stop failure)

Node Failure

• Quorum = Majority vote‒ Improves availability

‒ Avoids fence loops

‒ Downside: Need more nodes

• Smaller partitions are fenced

Node Failure

• Crash / reboot

• Network issues

• Leads to chaos without fencing‒ Cluster no longer knows if node is running resources

• Uncommunicative nodes are fenced‒ Enforces a known state

History Explorer

• Command line:‒ crm history

• Collect logs from cluster nodes

• Analyse transitions

• Present summary of events

• View configuration

• Transition graph

• Transition diff

• Extract logs during a particular transition

History Explorer

Example configuration

demo-node1

demo-node2

g-proxy

proxy proxy-vipping

Example Description

• Two web servers‒ Port 8000

• HAProxy‒ Port 80

‒ Load balancer (round robin)

• Failed action: kill -9 proxy detected by monitor

Failed Action

History Explorer

Pitfalls

Too many logs

• History explorer can get slow‒ Run HAWK in offline mode to avoid burdening cluster

• Find the relevant transitions

• Narrow the scope

• Command line:‒ timeframe <from> <to>

End of the tracks

• Analysing action failure‒ Example: monitor fails for unknown reasons

‒ Probes

‒ Before starting a resource, Pacemaker checks if it is running

‒ Success Is Failure

• Know your application‒ Start at action failure, read application logs backwards

‒ At this point, the cluster can't help you

General Confusion

• Which node wrote this log?‒ Was it even running the resource in question?

• Get back to the evidence‒ If in doubt, start over

• Cancelled Transitions‒ Sometimes, the history explorer gets confused

‒ Fencing can cancel a transition

‒ By default, Pacemaker fences offline nodes at startup

Possible Problems

• Network Latency‒ Does your network fulfill the requirements?

• Disk is full

• Misconfiguration‒ Use csync2 or configuration management tool

• Fencing device failure‒ Is fencing enabled?

‒ Does the fencing device work?

‒ Use SBD

Resource tracing

• crm resource trace <resource>

• /var/lib/heartbeat/trace_ra/<agent>/

• Note: Trace is written on node where resource runs

• Complete trace of every action‒ Can be a lot of data: remember to untrace!

Summary

• Try The New Hawk

• Use The History Explorer

• Follow The Evidence‒ Action Failure Leads To Actions

‒ Node Failure Leads To Fencing

‒ Without Fencing, Anything Can Happen

Open Source

https://github.com/ClusterLabs/hawk

https://github.com/ClusterLabs/crmsh

Thank you.

Questions?

www.suse.com

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.

Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure...

Documents

Highly Available Lustre with SRP Mirrored LUNs · Corosync Pacemaker ... Could not duplicate outside of corosync control Requires deactivating the out-of-sync volume, assembling the

Emerging Trends in Big Data - SUSECON · Apache Hadoop is an open source software ecosystem, built around the core Hadoop technology. ... Emerging Trends in Big Data Streaming –

From Zero to the Clouds: SUSE GPU Roadmap and Innovation Strategy · 2019-05-22 · SUSE GPU Roadmap and Innovation Strategy SUSECON 2019 - FUT1432 ... • GPU-enabled computing made

Zen and The Art of High-Availability Clustering...Pacemaker and corosync/OpenAIS-based stacks Key Features • Service Availability 24/7 – Policy driven clustering > OpenAIS messaging

Using Kiwi to Create Customized PoS Images - SUSECON · Using Kiwi to Create Customized PoS Images Cleber Paiva de Souza / Gabriel Cavalcante {cleber,gabriel}@ssys.com.br S-SYS Systems

SUSECON 2019 TUT1338 – CaaSP vs CAP · SUSE CaaS Platform is an enterprise class container management solution that ... • Toolchain module Orchestration • Kubernetes 1.9; Docker

MySQL High Availability and Geographical Disaster Recovery ... · Introduction to Corosync Corosync is a communication service Totem protocol (google it...) Token ring with membership

SLE12 Service-Pack Migration - SUSECON · SLE12 Service-Pack Migration What is possible and what supported Thorsten Kukuk Senior Architect SUSE Linux Enterprise Server ... SLES12

Best Practices in Monitoring - SUSECON · Best Practices in Monitoring Lars Vogdt Team Lead internal SUSE IT 2 About Lars Vogdt ... ‒ Icinga, PNP4Nagios,

A call to give back puppetlabs-corosync to the community

Inside Closed Process Groups - Corosync Cluster Engine · 2020. 10. 20. · What are Closed Process Groups Maintains membership at process group level A processor is uniquely identified

Installing VTS in High Availability Mode - Cisco · Stack: corosync Current DC: vts01 (version 1.1.14-70404b0) - partition with quorum 2 nodes and 4 resources configured Online: [

Software Defined Everything - SUSECON Defined Everything Management, Clouds, Containers, and Storage Pete Chadwick Senior Product Manager pchadwick@suse.com Joachim Werner Senior …

Shared Storage for Container Orchestrators with Manila...pacemaker-corosync as we do today downstream Not having only a single ganesha instance under pacemaker control enables us to

Micro Focus at #SUSECon 2015

Heart attack Virtualization - GARR · Heart attack Virtualization An HA virtualization cluster based on Pacemaker, Corosync, Xen and DRBD Davide Vaghetti davide.vaghetti@ing.unipi.it

From Idea to Working Deployment - SUSECON Idea to Working Deployment: A Practical Guide for Deploying SUSE ® Manager ... Management pack for System Center Operations Manager 2007/2012

Cluster - Sector Nord · 2020. 8. 25. · Technik: „Cluster Stack“ Corosync Cluster Engine OpenSource Framework für Cluster Engine 1. Gruppenkommunikation 2. Verfügbarkeitsmanagement

Ceph Internals & Data Processing Capabilities - SUSECON · Ceph Internals & Data Processing Capabilities Joao Eduardo Luis Senior Software Engineer jluis@suse.com / joao@suse.de

Btrfs and Rollback - SUSECON · Btrfs and Rollback How It Works and How to Avoid Pitfalls Thorsten Kukuk Senior Architect SUSE Linux Enterprise Server kukuk@suse.com