Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure...

Preview:

Citation preview

Exploring History with HawkAn Introduction to Cluster Forensics

Kristoffer GrönlundHigh Availability Software Developer

kgronlund@suse.com

2

This tutorial

• High Availability in 5 minutes

• Introduction to HAWK‒ What's new in HAWK 2

• History Explorer‒ Cluster Forensics

‒ Example Usage

• Summary

3

About me

• Kristoffer Grönlund‒ Developer

‒ crmsh

‒ hawk

‒ resource-agents

‒ Maintainer

‒ fence-agents

‒ haproxy

High Availability

5

High Availability

6

What is a cluster?

• Cluster → 1 - 32* Nodes

• Node → Single machine in cluster‒ Hardware or virtualized

‒ Remote nodes

• Site → Physical location‒ Local

‒ Metro

‒ Geographical

* Scale beyond 32 nodes with remote nodes

7

Resources

• Agent Classes‒ Open Cluster Framework (OCF) Agents

‒ resource-agents

‒ systemd services

‒ Fencing agents

‒ Init scripts

• Examples:‒ Web Server, File Server

‒ Databases

‒ Filesystems, IP Addresses

‒ VMs, resources in VMs...

8

Constraints

• Order‒ Start resource A before resource B

• Location‒ Resource A prefers node

• Colocation‒ Resource A with resource B

• Score‒ Mandatory vs. Preference

‒ Numeric value or +/- infinity

‒ Resource stickiness

9

Overview

Corosync

Messaging / Infrastructure

Resource Allocation

Resource Agents

ResourceResourceResource

Resource

Local Resource Manager Local Resource

Manager

Cluster Resource Manager

Policy Engine Cluster Information Base (CIB)

CIB Replica Cluster Resource

Manager

Corosync

Designated Coordinator (DC)

CO

RO

SYN

CPA

CEM

AK

ERR

ESO

UR

CES

10

Fencing

• Dealing with Schrödinger's cat

• Goal: Preventing corruption

• Storage based: SBD‒ Recommended if possible

‒ No special hardware required

• Hardware based: IPMI, iLO, …‒ Many supported devices

11

12

Tools

• crmsh‒ Command line interface

• HAWK‒ Web interface

13

Learn more

• www.suse.com/documentation/sle-ha-12/

• Two node cluster in two commands

node1 # ha-cluster-init

node2 # ha-cluster-join -c node1

Introducing HAWK

15

HAWK - Overview

• “High Availability Web Konsole”

• Monitoring

• Configuration / Administration

• Dashboard

16

HAWK - Technical details

• Installed by ha-cluster-bootstrap

• Runs on the cluster nodes

• Ruby on Rails

• https://<node>:7630/

17

HAWK - Security

• Default user is hacluster

‒ Remember to change the password

• HTTPS for secure access

• Replace SSL certificate with your own‒ /etc/hawk/hawk.key

‒ /etc/hawk/hawk.pem

HAWK 0.7

19

Status

20

Dashboard

HAWK 2

22

A New Look

• Complete visual overhaul‒ More intuitive

‒ Similar to other SUSE tools

• Improved features‒ History Explorer

‒ More powerful wizards

‒ Integrated help

• Supports new cluster features

23

Upgrading to HAWK 2

zypper install hawk2

24

Login

25

Status

26

Dashboard

27

Graph

28

Simulator

29

Simulator, node event

30

Simulator, results

31

Creating resources

32

Command log

Wizards

34

Wizards

• Apply a complete cluster configuration

• Helps configuring constraints and groups

• Install and configure required software

35

Wizards

36

Wizard, configuration

37

Wizard, verify changes

38

Wizard, advanced options

39

Wizard, optional steps

40

Wizard, verify changes (1)

41

Wizard, verify changes (2)

42

Command line wizards

crm script

list

show virtual-ip

verify virtual-ip id=admin-ip ip=10.13.37.42

run virtual-ip id=...

History Explorer

44

Cluster Forensics

• Something went wrong‒ How can we figure it out?

‒ Pitfalls

• Understanding the cluster logs‒ Use the history explorer

‒ Get a cluster report

45

Root Cause Analysis

• Start at the evidence

• Trace backwards

• Know the application

• Assume you know nothing

46

Jumping To Conclusions

• Always stay on the evidence

• When the evidence runs out, we are guessing

• Guessing is OK!‒ But know when you are guessing

47

The Evidence

• Failed Cluster Action‒ Software bugs, crashes

‒ Configuration error

• Failed Node‒ Hardware failure

‒ Communication error

48

Collecting data

crm report -f '2015-10-10 12:00' -t '2015-10-10 14:00' strange_event

49

Understanding the logs

2015-10-11T19:40:11.717167+02:00 sle12sp1a crmd[1590]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]2015-10-11T19:40:19.777412+02:00 sle12sp1a apache(srv2)[20777]: INFO: Successfully retrieved http header at http://localhost:80002015-10-11T19:40:24.524292+02:00 sle12sp1a crmd[1590]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]2015-10-11T19:40:24.528651+02:00 sle12sp1a pengine[1589]: notice: Restart admin_addr#011(Started sle12sp1b)2015-10-11T19:40:24.528851+02:00 sle12sp1a pengine[1589]: notice: Calculated Transition 156: /var/lib/pacemaker/pengine/pe-input-55.bz22015-10-11T19:40:24.530055+02:00 sle12sp1a crmd[1590]: notice: Processing graph 156 (ref=pe_calc-dc-1444585224-290) derived from /var/lib/pacemaker/pengine/pe-input-55.bz22015-10-11T19:40:24.530701+02:00 sle12sp1a crmd[1590]: notice: Initiating action 16: stop admin_addr_stop_0 on sle12sp1b2015-10-11T19:40:24.740118+02:00 sle12sp1a crmd[1590]: notice: Initiating action 6: start admin_addr_start_0 on sle12sp1b2015-10-11T19:40:24.801183+02:00 sle12sp1a crmd[1590]: notice: Initiating action 1: monitor admin_addr_monitor_10000 on sle12sp1b2015-10-11T19:40:24.836022+02:00 sle12sp1a crmd[1590]: notice: Transition 156 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-55.bz2): Complete

50

Internal components

• Cluster Information Base (CIB)

• Cluster Resource Management daemon (crmd)

• Local Resource Management daemon (lrmd)

• Policy Engine (pengine)

• Fencing daemon (stonithd)

51

Policy Engine

• Designated Controller (DC)‒ Elected automatically

‒ Calculates ideal cluster state

‒ Decides on actions to achieve state

52

Transition

• Sequence of actions to reach new state

• Records state before and after transition

• Saved to /var/lib/pacemaker/pengine/

• Numbered with sequence number‒ Number sequence may reset to 0 if DC is re-elected

53

Cluster Actions

• <resource>_<action>_<nn>

• Actions‒ start

‒ stop

‒ promote

‒ demote

‒ monitor

‒ migrate_to

‒ migrate_from

54

Cluster Actions

• Error Codes

0: Success

1: Generic Error

2: Argument Error

3: Unimplemented Action

4: Insufficient Permissions

5: Required Component Is Missing

6: Configuration Error

7: Resource Was Not Running

8: Running As Primary

9: Failed As Primary

55

Cluster Action Failure

• Unexpected result when performing action

• Triggers transition

• May also trigger fencing (stop failure)

56

Node Failure

• Quorum = Majority vote‒ Improves availability

‒ Avoids fence loops

‒ Downside: Need more nodes

• Smaller partitions are fenced

57

Node Failure

• Crash / reboot

• Network issues

• Leads to chaos without fencing‒ Cluster no longer knows if node is running resources

• Uncommunicative nodes are fenced‒ Enforces a known state

58

History Explorer

• Command line:‒ crm history

• Collect logs from cluster nodes

• Analyse transitions

• Present summary of events

• View configuration

• Transition graph

• Transition diff

• Extract logs during a particular transition

59

History Explorer

60

History Explorer

61

History Explorer

62

History Explorer

63

History Explorer

64

Example configuration

demo-node1

demo-node2

srv1

srv2

200

200

g-proxy

proxy proxy-vipping

50

65

Example Description

• Two web servers‒ Port 8000

• HAProxy‒ Port 80

‒ Load balancer (round robin)

• Failed action: kill -9 proxy detected by monitor

66

Failed Action

67

History Explorer

68

History Explorer

69

History Explorer

70

History Explorer

71

History Explorer

72

History Explorer

73

History Explorer

74

History Explorer

75

History Explorer

76

History Explorer

77

Pitfalls

78

Too many logs

• History explorer can get slow‒ Run HAWK in offline mode to avoid burdening cluster

• Find the relevant transitions

• Narrow the scope

• Command line:‒ timeframe <from> <to>

79

End of the tracks

• Analysing action failure‒ Example: monitor fails for unknown reasons

‒ Probes

‒ Before starting a resource, Pacemaker checks if it is running

‒ Success Is Failure

• Know your application‒ Start at action failure, read application logs backwards

‒ At this point, the cluster can't help you

80

General Confusion

• Which node wrote this log?‒ Was it even running the resource in question?

• Get back to the evidence‒ If in doubt, start over

• Cancelled Transitions‒ Sometimes, the history explorer gets confused

‒ Fencing can cancel a transition

‒ By default, Pacemaker fences offline nodes at startup

81

Possible Problems

• Network Latency‒ Does your network fulfill the requirements?

• Disk is full

• Misconfiguration‒ Use csync2 or configuration management tool

• Fencing device failure‒ Is fencing enabled?

‒ Does the fencing device work?

‒ Use SBD

82

Resource tracing

• crm resource trace <resource>

• /var/lib/heartbeat/trace_ra/<agent>/

• Note: Trace is written on node where resource runs

• Complete trace of every action‒ Can be a lot of data: remember to untrace!

83

Summary

• Try The New Hawk

• Use The History Explorer

• Follow The Evidence‒ Action Failure Leads To Actions

‒ Node Failure Leads To Fencing

‒ Without Fencing, Anything Can Happen

84

Open Source

https://github.com/ClusterLabs/hawk

https://github.com/ClusterLabs/crmsh

Thank you.

85

Questions?

www.suse.com

86

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.

Recommended