High Availability (HA) - Tech Daystechdays.org/wp-content/uploads/2013/06/AFinn-Day-1-HA.pdf · High Availability • From the Hyper-V ... Other CSV 2.0 Improvements ... • Search

High Availability (HA)

Aidan Finn

About Aidan Finn

• Technical Sales Lead at MicroWarehouse (Dublin)

• Working in IT since 1996

• MVP (Virtual Machine)

• Experienced with Windows Server/Desktop, System Center,

virtualisation, and IT infrastructure

• @joe_elway

• http://www.aidanfinn.com

• http://www.petri.co.il/author/aidan-finn

• Published author/contributor of several books

http://www.aidanfinn.com/

http://www.petri.co.il/author/aidan-finn

Books

System Center

2012 VMM

Windows Server

2012 Hyper-V

Agenda

• Item 1

• Item 2

• Item 3

4

What is HA?

High Availability• From the Hyper-V perspective, HA is about infrastructure fault

tolerance

• Example:

1. HostA is one of a number of hosts in a cluster

2. Every host in the cluster stores VM files on shared storage, such

as a SAN

3. VM01 is running on HostA

4. HostA stops running

5. VM01 automatically fails over to another host in the cluster

6. VM01 automatically boots up

• There is some downtime for VM01 but it is minimized

• The cluster has acted as a unit to protect against the failure of

HostA

A Typical Hyper-V Cluster• Two or more hosts

• Each host is connected to a set of networks with special roles

• All hosts are connected to the a shared cluster-supported storage

system

• All HA VMs are stored on the shared storage

Managing Failure

The Heartbeat

You there?

Yes

• Failover Clustering conducts health

monitoring between nodes to detect

when servers are no longer available

• When servers are unresponsive

clustering takes recovery action

• Unicast in nature and uses a Request-

Reply type process for reliability and

security

– Not just a basic ping

Failover Cluster Virtual Adapter

• Failover Cluster Virtual Adapter (NetFT) is a virtual network

adapter that builds fault-tolerant TCP connections across all

available interfaces between nodes in the cluster

• NetFT is the mechanism by which clusters use multiple cluster-

enabled adapters to communicate

• Seamless internode communication

– NetFT will dynamically and seamlessly switch cluster

communication to a different network (based on priority)

when a network fails

• Long story short: The cluster can use multiple enabled networks

for cluster communications and is fault tolerant

10

Heartbeat Detection

• Runs on TCP 3343

• WS2012 R2 Hyper-V clusters:

– Nodes exchange heartbeats every 1 second

– Will allow for failure for up to 10 seconds (5 on non Hyper-V)

for nodes on the same subnet

– Will allow for failure for up to 20 seconds (5 on non Hyper-V)

for nodes on different subnets

• No response after that threshold – the host is assumed offline

and quorum must be obtained

11

Quorum

What is Quorum?

• Quorum is when you have enough voters to come to an

agreement

• Primary function of cluster is to keep mission critical services

online

• It needs to accomplish this without causing corruption or

confusion

• This is why quorum is used

13

Explaining Quorum

14

Quorum Basics

• Sticking with WS2012 R2 to keep this simple

• Two types of vote breaker or witness

– Witness disk

• A 1 GB LUN that is created on the shared storage just for

this purpose

• Configured as a witness disk in the cluster

• Owner of the disk is the vote breaker in case of tied vote

for quorum

– File Share Witness

• Originally intended for multi-site clusters

• Now used for clusters that use SMB 3.0 for shared

storage

15

Explaining Quorum

16

Other Quorum Concepts

• Sequential host failure (WS2012)

– Scenario when one host after another after another drops

offline

– Quorum can be still obtained, even if less than half the

nodes are online

• Dynamic quorum (WS2012 R2)

– When the cluster rigs the quorum voting process

– Intended to give cluster more chance of staying online

– ALWAYS have a quorum witness

• Used to only do it with uneven number of nodes before

WS2012 R2

17

Cluster Storage

Why Storage Is Needed

• The Hyper-V hosts provide HA to VMs

• Each host must have access to the VMs’ storage

• There is no replication from host-to-host inside a cluster

• All VMs are stored on shared storage

• Options include

– SAS storage area network (SAN)

– iSCSI SAN

– Fibre channel SAN

– Fibre Channel over Ethernet (FCoE) SAN

– PCI RAID (WS2012 +)

– Storage Spaces (WS2012 +)

– SMB 3.0 (WS2012+ file shares)

19

Connectivity

• Each node is connected to the shared storage

• Exact same connectivity

• Dual path connectivity

– Multipath IO (MPIO) for traditional storage

– SMB Multichannel for SMB 3.0 storage

• All disks/LUNs/shares on the storage are assigned to all nodes

– Each host has equal access

20

Cluster-in-a-Box• Take the requirements of a cluster

• Put it into a single enclosure

– 2+ blade servers with own

power + networking

– JBOD or PCI RAID shared

storage

– Hard wired cluster networking

• Designed for SME and branch

office

Cluster Shared Volumes

Cluster Shared Volume (CSV)• Microsoft’s cluster file system

• Makes the volume on the disk active/active across all nodes

• Store lots of VMs on a single volume

– All able to run on any node in the cluster

• Every node connected to the disk can read/write to the volume

• One node owns the volume and is responsible for metadata

operations:

– Owner

– AKA CSV Coordinator

• No drive letter

– Drive is mounted as folder under C:\ClusterStorage on each node

– You can (I usually do) rename that folder, e.g. CSV1

CSV Illustrated

Redirected IO• A process used by CSV (only)

– Nodes in cluster redirect storage IO to pass over cluster network

via CSV coordinator

– Done on per-CSV basis, not per-cluster

• Used by W2008 R2 CSV for backup

– Caused concern

– Redirected IO NOT USED FOR BACKUP SINCE WS2012

• Redirected IO is used by WS2012 for:

– Very brief metadata operations: permissions, file metadata, file

create, file open, file extend

– Storage path fault tolerance: Node loses direct connection to

storage and redirects via CSV owner to avoid outage

Redirected IO Illustrated

Controlling Redirected IO• On W2008 R2:

– Redirected IO went across the cluster communications network

– Network with lowed routing metric (could be manipulated)

• On WS2012 and later:

– Uses SMB 3.0 and SMB Multichannel

– Can flood equal speed networks between nodes if not controlled

– Use SMB Multichannel Constraints to select which networks to

talk to other cluster nodes

– New-SmbMultichannelConstraint -ServerName Node2,Node3 –

InterfaceAlias ClusterNet1,ClusterNet2

CSV Cache• A read cache for virtual hard disks stored on the CSV

• Uses percentage of cluster node’s RAM for the cache

– Size of cache is set once per cluster

– Boost read performance, e.g. VDI boot storm

– (Get-Cluster). SharedVolumeBlockCacheSizeInMB = 512

• WS2012

– Up to 20% of nodes’ RAM could be assigned to cache

– Enable each required CSV for CSV Cache

– Get-ClusterSharedVolume “Cluster Disk 1” | Set-ClusterParameter CsvEnableBlockCache 1

– Required CSV to be disabled/enabled to start caching

• WS2012 R2

– Up to 80% of nodes’ RAM can be assigned to cache

– CSV Cache enabled by default for each CSV

Other CSV 2.0 Improvements• WS2012:

– Uses mount point instead of junction point

– Single synchronised VSS Snapshot for backup - no Redirected

IO during backup

– Can enable BitLocker

– NTFS on CSV appears as CSVFS

– Supported for Hyper-V and Scale-Out File Server

• WS2012 R2:

– Supports ReFS file system – I still would not do it yet unless

volumes are huge (no CHKDSK)

– CSV ownership is automatically load balanced across nodes

Networking

Converged Networks• In W2008 R2 we would have had 1 NIC or NIC team per required

network

– Lots of NICs

– Very expensive to add 10 GbE or faster networking for peak

usage

• Converged networks concept:

– Aggregate fewer NICs into an accumulation of bandwidth

– Divide that bandwidth up using WS2012+ QoS into required

networks

– Makes adopting 10 GbE or faster from economic for

medium/larger companies

• Much bigger concept than I have time to talk about today. See my

posts on aidanfinn.com and Petri IT Knolwedgebase

Non-Converged with iSCSI

• With SAS/FC SAN: 4 NICs, 8 with NIC teaming

• With iSCSI SAN: 6 NICs, 10 with NIC teaming

• 1-2 more with dedicated backup network

Convergence Using Virtual NICs

NIC TeamHyper-V Port or

Dynamic

pNIC2(DVMQ)

pNIC1(DVMQ)

Trunk Ports

Top-of-Rack Switches

(VLAN 201) (VLAN 202)

MPIO

SAS/FC/iSCSIStorage Adapters

Virtual Switch

Management OS

Management(VLAN 101)

Cluster(VLAN 102)

Live Migration(VLAN 103)

Backup(VLAN 104)

Convergence with SMB 3.0 Storage

Management OS

Clu

ster (W=1

0)

Backu

p (W=20

)

iWARP rNIC(RSS, DCB)

iWARP rNIC(RSS, DCB)

SMB

Direct (W

=70

)

10 Gbps Switch1(VLAN 201, DCB)

10 Gbps Switch1(VLAN 202, DCB)

1 Gbps Switch1(Trunk Port)

1 Gbps Switch2(Trunk Port)

1 Gbps NIC

NIC Team(Dynamic)

Management(VLAN 101, W=50)

1 Gbps NICSMB Multichannel

Constraint

Building a Cluster

Creating a Cluster• Easier than ever

• Get the pieces right first:

– Storage

– Networking

• Process:

1. Validate the cluster – fix until it passes

2. Deploy the cluster

– Get it right up front and it takes a few minutes

– Possible to automate using PowerShell

Finish the Cluster

Completing the Cluster• Run Windows Update

– To get updates published via Windows Update

• Search for “Recommended update for Windows Server 2012 R2

Failover Clustering”

– To get the bug fixes that are usually not published via Windows

Update

Configure Cluster Networks• Rename the networks in Failover Cluster Manager

– I name them after the NICs that are on the networks

• Select your Live Migration network(s)

– Check multiple boxes if you elect to use SMB Live Migration

Configure Witness• Cluster wizard will automatically find a suitable Disk Witness if one is

available

– Make sure you check this

• You will have to add a File Share Witness if using SMB 3.0 storage

Configure Storage• SMB 3.0 Storage

– Create one share for File Share Witness

– Create one or more shares for storing VMs

– Add all hosts to a security group

– Add all admins to a security group

– Grant full control to the shares

• Disk storage

– Provision 1 * 1 GB disk for disk witness

– Provision 1 or more LUNs per node in the cluster to store VMs

– Connect the disks to all nodes in the cluster

– Activate (GPT) and format the disks in Disk Manager on one node

– Add the disks to the cluster

– Convert storage disks to CSVs

– I rename the mount points to have consistent names

Patching a Cluster

• Simple orchestration of cluster node updates

• Determines updates needed, moves workloads off nodes for updates– Uses Windows Update Agent direct

from Microsoft or from WSUS

– Identifies node with least load

– Puts node in maintenance mode

– Verifies success, then moves to next node

• Maintains service availability and without impacting cluster quorum

• Can be:– Scheduled

– Manually started from remote Failover Cluster Manager console

Update

Coordinator

Cluster Aware Updating

Enabling Cluster Self-Updating• Place all cluster nodes and cluster computer account in an OU for

the cluster

• Delegate rights to cluster CAP

– Create/manage computer objects in this OU

– This is used to create another CAP/computer object for self-

updating CAU

• Launch CAU wizard in Failover Cluster Manager

– Configure Self-Updating job

Managing VMs

Use FCM• All management of HA VMs is locked out in Hyper-V Manager

• Use Failover Cluster Manager

• You can order failover of VMs using Virtual Machine Priority

(High/Medium/Low)

• You can drain a node of VMs by pausing the host

Backup• There are products that support a WS2012 R2 Hyper-V cluster

• And then there are products that do it at least decently

• Test & research

• Do not trust sales & marketing

• Many have been stung, especially by:

– Companies known by 2 letters

– Companies that add support 12 months after a server release

Documents

High Availability (HA) - Tech Daystechdays.org/wp-content/uploads/2013/06/AFinn-Day-1-HA.pdf · High Availability • From the Hyper-V ... Other CSV 2.0 Improvements ... • Search