Pacemaker+DRBD

High Availability Clustering with Pacemaker and DRBD

Dan Frîncu

2

Dan Frîncu@previous experience• Have been working with clustering technologies for 3 years

• The first 2 years were spent

– Migrating cluster stack to Pacemaker, OpenAIS/Corosync, DRBD

– Giving trainings to Product Managers, Sales, Delivery Engineers, Support teams

– Integrating the company's software products with this cluster stack and with other cluster technologies

– Performance testing, hardware benchmarks, designing cluster solutions for company's clients (RFI, RFP), writing documentation, packaging, deploying solutions remotely

3

Dan Frîncu@1&1 Internet Development• Co-developer of the LinuxDesktop project

• Responsible for IT Operations on the LinuxDesktop project

• The backend for LinuxDesktop is a cluster running on Pacemaker, Corosync & DRBD

• <spam> LinuxDesktop is an custom built GNU/Linux operating system developed for 1&1 employees </spam>

4

• Clustering - Introduction

• High Availability Clustering – A historical background and future endeavors

• Cluster components – Tools of the trade

• Clustering scenarios – Fitting the needs

• Resource agents – Controlling cluster services

• Demo

• Q&A

5

• Generic types of clusters– HA – High Availability (a.k.a. failover clusters)

• Failover/Failback

• Load Balancing

– HPC – High Performance Computing

• Parallel Programming

• Distributed Computing

ClusteringIntroduction

6

• Why do I need a cluster?– “There's no Upside to Downtime” (SAForum.org)

– Hardware redundancy does not account for software bugs, human error or gremlins chewing on the cables

– I'm a developer working on an application, do I need a cluster?

– It depends on what your requirements are!

– Most Dev's don't have full access to the backend they work on

– When there is an issue, detection of a fault can be automatic, but recovery is done through human intervention, which can take time


7

• HA Clusters – How low can you go?– Minimum number of nodes for a HA cluster is 2

– There is no theoretical upper limit to the number of nodes, but HA clusters usually span 2-32 nodes

– If you need more than 32 nodes in one cluster, you probably need to rethink the design of the cluster, could be that HPC fits better

– Default setups can go up to 8-10 nodes without any specific tweaks

– Going above 10 nodes requires taking into consideration delays, mostly network related (STP convergence, multicast groups join/part, etc.)


8

• HA Clusters – What can 2 nodes do?– The most common size for HA clusters is 2 nodes

– Active/Passive – Applications run on one node, if it fails the other node takes over and starts all apps on it

– Active/Active – Applications run on all/a subset of all nodes (usually resources must be either stateless or depend on a shared storage to work)

• Again, minimum number of nodes for both Active/Passive and Active/Active is 2

• Shared storage can be a dedicated SAN or be easily achieved through use of DRBD


9

• HA Clusters – Beyond 2 nodes?– N+1 – N nodes with 1 backup node, applications run

on any of the N nodes, either node fails, the backup node takes over service

– N+M – N nodes with M backups, usually it is calculated what the expected hardware failure ratio will be for a specific service, then based on this the number of backups is put into play (e.g.: 4:1, 7:2, etc.)

– N-to-1 – A variation of N+1, it does the same thing, but for a limited timeframe, after restoring the failed node, the service would failback

– N-to-N – N+M meets Active/Active


10

• Clustering - Introduction





• Demo

• Q&A

11

• Once upon a time, there was Heartbeat

• Main software that came out of the Linux-HA project

• Heartbeat v1– Was limited to two nodes

– Supported only very simple failover

– Had no resource monitoring (external monitoring was required)

High Availability ClusteringA historical background and future endeavors

12

• Heartbeat v2– Added support for n-node clusters

– Resource monitoring

– Dependencies

– Policies


13

• Heartbeat v2.1.4 – A fork split in the road– Cluster Resource Manager split off into independent

project – Pacemaker

– Resource Agents and Cluster-Glue moved into separate packages

– Heartbeat name is associated from this point forward only with the cluster messaging and membership layer


14

• Heartbeat v3 – under new leadership– Starting with January 2010, Heartbeat code base

development done by LINBIT (whom also develop DRBD)

• LINBIT announced it has– “no intention to add significant features to the

Heartbeat code base, or extend its functionality significantly”

– “no intention to establish the Heartbeat code base as a long-term alternative or competition to the OpenAIS/Corosync cluster messaging layer”


15

• Heartbeat v3 – A glance into the future– Why continue to use Heartbeat?

• It works!

• Simple configuration!

• People don't like change™

– There are two sides to every story

• No upper limits, but cluster cannot grow beyond a maximum message size <64kB (~16 hosts)

• No support for cluster filesystems (GFS2, OCFS2, CLVM2, etc.)

• No new features to be developed


16

• OpenAIS/Corosync – The story begins– Service Availability Forum (HP, Oracle, Erricson, a.o.)

defined the Application Interface Specification (AIS), an API designed to provide inter-operable HA services, from which the OpenAIS project began its life

– In 2008, OpenAIS (a OSI Certified implementation of the AIS spec) got split into 2 projects: Corosync and OpenAIS

– Corosync provides cluster messaging & membership

– OpenAIS provides the rest of AIS spec (plugin to Corosync)


17

• Corosync – The basics– It's a cluster messaging and membership layer

providing reliable communications between nodes

– Supports multiple transports, such as unicast, multicast, broadcast, as well as InfiniBand

– Supports clustered filesystems (GFS2, OCFS2, CLVM2, etc.)

– Configurable maximum message size (1MB by default) which means it can scale to more nodes and resources per node than Heartbeat

– Redundant self-recovering communication rings (starting with version 1.4.0)


18

• Corosync – The basics– Used by RedHat as the only cluster stack for

Pacemaker starting with RHEL6

– Used as High Availability framework by Pacemaker and Apache Qpid

– Used as communications layer by Sheepdog, Proxmox VE (v2.0) and Openfiler (v2.99)

– Runs on all major GNU/Linux distros: SLES, RHEL, Ubuntu, Debian, Fedora, Gentoo


19

• Pacemaker – The road ahead– It is a Cluster Resource Manager

– Detects and recovers from node and resource-level failures

– Supports both Corosync and Heartbeat stacks

– Resource agnostic

– Supports STONITH for ensuring data integrity

– Automatically replicated configuration

– Python-based unified, scriptable, cluster shell

• Validation of input prior to commit

• Syntax highlighting


20

• Pacemaker – The road ahead– Tool for making offline configuration changes

– Trigger recurring actions at known times (cron-like or based on date comparisons – gt, lt, in-range)

– RelaxNG-based configuration schema

– Connecting to the CIB from non-cluster machines

– Supports cluster-wide service ordering, colocation and anti-colocation

– Supports advanced services

• Clones: services that need to run on N nodes

• Multi-state: Master/Slave, Primary/Secondary


21

• Pacemaker – Future developments– Given the possible limitations to the number of

nodes/resources within a single Pacemaker cluster (not limited by Pacemaker itself but by the underlying messaging stack), scalability to thousands of nodes is in question

– To address it, development of Pacemaker Cloud has already begun

– Pacemaker Cloud provides high levels of service availability for high scale cloud deployments

– Reuses PEngine library from Pacemaker


22

• Pacemaker – Future developments– Integrates with several technologies

• Matahari – A stripped down version of Pacemaker suited for running inside VM's

• DeltaCloud – An API that abstracts the differences between cloud providers, preventing a vendor lock-in

– Project under development, not yet ready for mainstream use


23

•Pacemaker – Future developments– Strech clusters

• (multi-site clusters/clusters of clusters) was discussed as being under development

• On 4th of December, Booth cluster ticket manager was launched

• Multi-site clusters can be considered as “overlay” clusters where each cluster site corresponds to a cluster node in a traditional cluster

– Scalability is addressed both on short term as well as on long term


24

• DRBD – Shared storage made easy– Spans 2 cluster nodes (Master/Slave or

Master/Master)

– All write I/O synchronously replicated to other node

– Also considered to be a Network-based RAID-1

– Highly used in the industry, including in 1&1, therefore most of its features are well known

• Stacked resources, which can lead to a 3-way and even 4-way replication (as of version 8.3)

• Adaptive dynamic resync rate controller (starting with 8.3.11)


25

•DRBD – Shared storage made easy• Multi volume feature allows usage of several minor

devices “within the same resource”

• Planned features include development of a „full data log“ (it may have another name when released), which would allow for the Secondary to be consistent even after replication link “hickups” or fallback to bitmap


26

• Clustering – Introduction





• Demo

• Q&A

27

Cluster componentsTools of the trade

28

• Pacemaker's internal components– CRMd – Cluster Resource Manager daemon (a

message broker between PEngine and LRMd)

– LRMd – Local Resource Manager daemon (non-cluster aware daemon that interacts with resource agents – scripts – directly)

– PEngine – Policy Engine (the “brain”, computes the next state of the cluster based on current state + conf)

– CIB – Cluster Information Base (contains all cluster information, synchronizes updates to all nodes)

– STONITHd – Shoot-The-Other-Node-In-The-Head Daemon (a subsystem for node fencing)


29


30


• High Availability Clustering – A historical background and future endeavours




• Demo

• Q&A

31

• Pacemaker can support practically any redundancy configuration including– Active/Active

– Active/Passive

– N+1

– N+M

– N-to-1

– N-to-N

Clustering scenariosFitting the needs

32


33


34


35


36


37






• Demo – Don't try this at home

• Q&A

38

• Definition: a Resource Agent is a standardized interface for a cluster resource

• Pacemaker supports four types of RA's:– Heartbeat v1 (legacy, deprecated)

– LSB (Linux Standard Base) “init scripts”

– OCF (Open Cluster Framework)

– STONITH Resource Agents

• Most Resource Agents are coded as shell scripts

Resource agentsControlling cluster services

39

• LSB Resource Agents– Are the scripts found in /etc/init.d/

– Require LSB compliance in terms of exit codes and arguments for usage within a Pacemaker cluster

– Although many distributions boast LSB compliant init scripts, they ship with broken ones

– Broken LSB compliance leads to “controlling the service via init script works”, “controlling the service via Pacemaker doesn't work”

• Always check if a script is LSB compliant before adding it to a Pacemaker cluster


40

• OCF Resource Agents– Are the scripts found in

/usr/lib/ocf/resource.d/provider/

– The OCF spec is an extension of the definitions for LSB Resource Agents

– Require same LSB compliance in terms of exit codes and arguments for usage within a Pacemaker cluster

– Support additional parameters to be passed to the script

– Support additional actions compared to the LSB Resource Agents

– Can be tested with ocf-tester for compliance


41

• Supported operations of Resource Agents– start: enable or start the given resource

– stop: disable or stop the given resource

– monitor: check whether the resource is running or not

– validate-all: validate the resource's configuration

– meta-data: return information about the RA itself (used by GUI's and other tools)


42

• Additional operations provided by OCF Resource Agents– promote: promote the local instance of a resource to

the master/primary state

– demote: demote the local instance of a resource to the slave/secondary state

– notify: used by the cluster to send the agent pre and post notification events to the resource

– reload: reload the configuration of the resource

– migrate_from/migrate_to: perform live migration of a resource


43

• Resource scores– Every resource has a score, even if not explicitly

defined

– CRM (through Pengine) uses scores to calculate resource placement within the available cluster nodes

– Every action related to placement of a resource is related to a score attribution and its manipulation

– Highest score INF (1.000.000), lowest score -INF

(-1.000.000); resources can get any score within the range, including INF/-INF

– Positive values mean “can run”, negative values mean “cannot run”; +/- INF change “can” to “must”


44






• Demo

• Q&A

45






• Demo

• Q&A

46

Q&A


47

• Useful resources and links– http://fghaas.wordpress.com/2009/11/16/linbit-announces-stewardship-for-

heartbeat-code-base/

– http://www.saforum.org/Application-Interface-Specification~217404~16627.htm

– http://linux-ha.org/wiki/LSB_Resource_Agents

– http://linux-ha.org/wiki/OCF_Resource_Agents

– http://www.openais.org/doku.php

– http://www.corosync.org/doku.php

– http://www.clusterlabs.org

– http://www.drbd.org

Contact

Technology

Pacemaker+DRBD