51
Exchange Server 2013 High Availability Scott Schnoll Microsoft Corporation [email protected] Twitter: @Schnoll Blog: http://aka.ms/schnoll

Copyright© Microsoft Corporation DAG Architecture

Embed Size (px)

Citation preview

Exchange Server 2013 High AvailabilityScott SchnollMicrosoft [email protected]: @SchnollBlog: http://aka.ms/schnoll

Copyright© Microsoft Corporation

Agenda DAG architecture

MSExchangeRepl MSExchangeDAGMgmt Cluster Crimson Channel

Witness Server Placement Dynamic Quorum DAG Member Maintenance Managed Availability

DAG Architecture

DAG Replication Service Introduced in Exchange 2007 RTM

Microsoft Exchange Replication service | MSExchangeRepl MSExchangeRepl.exe Runs on all Mailbox servers (not just DAG members) Communicates with Active Directory and other DAG members

Includes 16 componentsActive Directory lookup Replay RPC server wrapper TPR API manager

Copy status lookup Remote data provider wrapper Support API manager

Replay core manager VssWriter Server locator manager

Seed manager Active Manager Health state tracker

Autoreseed manager Active Manager RPC server wrapper

Disk reclaimer manager Failure item manager

Copyright© Microsoft Corporation

DAG Management Service Introduced in RTM CU2

Microsoft Exchange DAG Management service | MSExchangeDagMgmt

MSExchangeDagMgmt.exe Runs on all Mailbox servers (not just DAG members) Communicates with Active Directory and other DAG members

Includes 4 components Active Directory lookup Copy status lookup Monitoring Tracer instance

Copyright© Microsoft Corporation

DAG Management Service Writes events to same place as Replication service Application event log (source of MSExchangeRepl) HighAvailability crimson channel

Created for two primary reasons: so the Replication service can have more focused functionality so Managed Availability actions can kill lower-priority activities

As we refactor more, other functions will move to this service AutoReseed Disk Reclaimer Dynamic replay lag playdown Future AutoDAG copy layout and mobility features

Copyright© Microsoft Corporation

Cluster Service Introduced in NT Server Enterprise Edition (1997) Cluster Service | ClusSvc Clussvc.exe

Exchange DAGs use several Cluster components Quorum Membership and Node Management Networks and Heartbeating Cluster Registry

Copyright© Microsoft Corporation

Cluster Service Quorum is required in order to mount databases

Quorum is based on votes, not membership Voting can be rigged

Votes can be taken away manually or dynamically

Exchange manages quorum model, not quorum Exchange management of quorum model based on nodes, not votes Removing votes requires manual configuration of quorum model Exchange will make incorrect quorum model management decisions

if votes are manually removed at the cluster level

Copyright© Microsoft Corporation

Cluster Registry Active Manager stores database / server information in the cluster registry for DAG members Registry changes are replicated immediately to all DAG members

Stored information is used as part of BCSS

Copyright© Microsoft Corporation

Cluster RegistryIsEntryExist?True*ActiveServer?ex2*LastMountedServer?ex2*LastMountedTime?2013-07-15T22:29:39*MountStatus?Mounted*IsAdminDismounted?False*IsAutomaticActionsAllowed?True*

ActiveServer Name of the server where the database is currently mounted or is expected to be mounted

when mount operations complete

LastMountServer The name of the server where the database was last successfully mounted

LastMountedTime The date and time stamp of the last time the database was mounted

Copyright© Microsoft Corporation

Cluster RegistryIsEntryExist?True*ActiveServer?ex2*LastMountedServer?ex2*LastMountedTime?2013-07-15T22:29:39*MountStatus?Mounted*IsAdminDismounted?False*IsAutomaticActionsAllowed?True*

MountStatus The current mount status for the database Possible values are mounted / dismounted

IsAdminDismounted Designates whether the current dismounted status of the database is the result of

administrator action Possible values are True / False

IsAutomaticActionsAllowed Designates whether the database can be automatically activated by AM Possible values are True / False

Copyright© Microsoft Corporation

Cluster Registry Last Log

Entry for each database copy in the DAG (named by the database GUID)

Stores the last sequence number of the last generated log (in decimal)

Copyright© Microsoft Corporation

Crimson Channel Applications and Services logs

Area of Windows Server event log used by applications for logging and internal communication

These logs store events from a single application or component rather than events that might have system-wide impact

This is referred to as an application's crimson channel

Exchange 2013 has multiple channels ActiveMonitoring HighAvailability MailboxDatabaseFailureItems ManagedAvailability PushNotifications Troubleshooters

Copyright© Microsoft Corporation

Crimson Channel

Witness Server Placement

Copyright© Microsoft Corporation

Witness Server An external voter that adds a tie-breaking vote to a DAG with an even number of members Does not contain a full copy of quorum data Represented by File Share Witness resource

File share witness cluster resource, directory, and share are automatically created by Exchange when needed and removed by Exchange when not needed

Uses IsAlive Check for availability If witness is not available, cluster core resources are failed and

moved to another node If another node does not bring witness resource online, the

resource will remain in a Failed state, with restart attempts every 60 minutes

Copyright© Microsoft Corporation

Witness Server Placement Basic guidance for placement of witness server in Exchange 2010

“We recommend that you use a Hub Transport server running on Microsoft Exchange Server 2010 in the Active Directory site containing the DAG. This allows the witness server and directory to remain under the control of an Exchange administrator.”

“If your DAG is extended to multiple datacenters, we recommend deploying the witness server in the datacenter that is considered to be the primary datacenter.”

Copyright© Microsoft Corporation

Witness Server Placement Exchange 2013 guidance more complicated due to new options introduced by architectural changes

Exchange 2013 includes support for configuration options that were not recommended or possible in previous versions of Exchange A third location, such as a third physical datacenter or branch office

Copyright© Microsoft Corporation

Witness Server Placement Ultimately, the placement of a DAG’s witness server depends on business requirements and the options available to the organizationDeployment Scenario Recommendations

Single DAG deployed in a single datacenter

Locate witness server in the same datacenter as DAG members

Single DAG deployed across two datacenters; no additional locations available

Locate witness server in primary datacenter

Multiple DAGs deployed in a single datacenter

Locate witness server in the same datacenter as DAG members. Additional options include:• Using the same witness server for multiple DAGs• Using a DAG member to act as a witness server for a different DAG

Multiple DAGs deployed across two datacenters

Locate witness server in the same datacenter as DAG members. Additional options include:• Using the same witness server for multiple DAGs• Using a DAG member to act as a witness server for a different DAG

Single or Multiple DAGs deployed across more than two datacenters

Locate the witness server in the datacenter where you want the majority of quorum votes to exist

Copyright© Microsoft Corporation

Witness Server Placement If the organization has a third location, then the DAG’s witness server can be deployed there for automatic site resilience The witness server location must have network infrastructure and

connectivity that is isolated from network failures that affect the two datacenters with Exchange

For all DAGs, the availability of the witness server should be on the Exchange administrator’s radar

Copyright© Microsoft Corporation

Witness Server Placement Azure is not supported for use as a Witness Server for Exchange DAGs

Azure does not support the required underlying network configuration to enable an Azure file server VM to act as a witness server

More info at http://aka.ms/DAGAzure

Dynamic Quorum

Copyright© Microsoft Corporation

Dynamic Quorum Windows Server 2012 Cluster (and later) feature

Cluster quorum majority is determined by the nodes that are active members of the cluster at a given time This is an important distinction from the cluster quorum in previous

versions of Windows Server, where the quorum majority is fixed and based on membership

Enabled for all clusters by default

Copyright© Microsoft Corporation

Dynamic Quorum Cluster dynamically manages the vote assignment to nodes, based on the state of each node When a node shuts down or crashes, the node loses its quorum vote When a node successfully rejoins the cluster, it regains its quorum

vote By dynamically adjusting the assignment of quorum votes, the

cluster can increase or decrease the number of quorum votes that are required to keep running

This enables the cluster to maintain availability during sequential node failures or shutdowns

Copyright© Microsoft Corporation

Dynamic Quorum With dynamic quorum management, it is also possible for a cluster to run on the last surviving cluster node By dynamically adjusting the quorum majority requirement, the

cluster can sustain sequential node shutdowns to a single node This is referred to as “Last Man Standing” scenario

Copyright© Microsoft Corporation

Dynamic Quorum Dynamic quorum management does not allow the cluster to sustain a simultaneous failure of a majority of voting members

To continue running, the cluster must always have a quorum majority at the time of a node shutdown or failure

If you explicitly remove the vote of a node, the cluster cannot dynamically add or remove that vote

Copyright© Microsoft Corporation

Dynamic Quorum DQ = 7

Copyright© Microsoft Corporation

Dynamic Quorum DQ = 4

XX

X

Copyright© Microsoft Corporation

Dynamic Quorum DQ = 3

XX

XX

Copyright© Microsoft Corporation

Dynamic Quorum DQ = 2

XX

XXX

Copyright© Microsoft Corporation

Dynamic Quorum Use Get-ClusterNode to verify DynamicWeight common property of Node 0 = does not have quorum vote 1 = has quorum vote

Get-ClusterNode <Name> | ft name, *weight, state

Vote assignment for all cluster nodes can be verified by using the Validate Cluster Quorum test

Name DynamicWeight NodeWeight State---- ------------- ---------- -----EX1 1 1 Up

Copyright© Microsoft Corporation

Dynamic Quorum and DAGs Dynamic quorum does not change quorum requirements for DAGs

Dynamic quorum does work with DAGs All internal DAG testing is performed with dynamic quorum enabled

Dynamic quorum is enabled in Office 365 for DAG members on Windows Server 2012

Exchange is not dynamic quorum-aware

Copyright© Microsoft Corporation

Dynamic Quorum and DAGs Cluster team guidance on dynamic quorum:

“Selecting this option generally increases the availability of the cluster. By default the option is enabled, and it is strongly recommended to not disable this option. This option allows the cluster to continue running in failure scenarios that are not possible when this option is disabled.”

Exchange team guidance on dynamic quorum: Leave it enabled for majority of DAG members Don’t factor it into availability plans

The advantage is that, in some cases where 2008 R2 would have lost quorum, 2012 can maintain quorum; this only applies to a few cases, and should not be relied upon when planning a DAG

DAG Member Maintenance

Copyright© Microsoft Corporation

DAG Member Maintenance Basic guidance for DAG member maintenance in Exchange 2010 Run StartDagServerMaintenance.ps1 to put DAG member in

maintenance mode Perform the maintenance (e.g., install the update rollup) Run StopDagServerMaintenance.ps1 to take DAG member out of

maintenance mode and put it back into production Optionally rebalance the DAG by using

RedistributeActiveDatabases.ps1

Copyright© Microsoft Corporation

DAG Member Maintenance Exchange 2013 guidance more complicated

Go into Maintenance ModeSet-ServerComponentState <Server> -Component HubTransport -State Draining -Requester MaintenanceSet-ServerComponentState <Server> -Component UMCallRouter –State Draining –Requester MaintenanceRedirect-Message -Server <Server> -Target <FQDNTarget>Suspend-ClusterNode <Server>Set-MailboxServer <Server> -DatabaseCopyActivationDisabledAndMoveNow $TrueSet-MailboxServer <Server> -DatabaseCopyAutoActivationPolicy BlockedSet-ServerComponentState <Server> -Component ServerWideOffline -State Inactive -Requester Maintenance

Verify Maintenance ModeGet-ServerComponentState <Server> | ft Component,State -AutosizeGet-MailboxServer <Server> | ft DatabaseCopy* -AutosizeGet-ClusterNode <Server> | flGet-Queue

Copyright© Microsoft Corporation

DAG Member Maintenance Exchange 2013 guidance more complicated

Go into Production ModeSet-ServerComponentState <Server> -Component ServerWideOffline -State Active -Requester MaintenanceSet-ServerComponentState <Server> -Component UMCallRouter –State Active –Requester MaintenanceResume-ClusterNode <Server>Set-MailboxServer <Server> -DatabaseCopyActivationDisabledAndMoveNow $FalseSet-MailboxServer <Server> -DatabaseCopyAutoActivationPolicy UnrestrictedSet-ServerComponentState <Server> -Component HubTransport -State Active -Requester Maintenance

Verify Production ModeGet-ServerComponentState <Server> | ft Component,State -AutosizeGet-MailboxServer <Server> | ft DatabaseCopy* -AutosizeGet-ClusterNode <Server> | flGet-Queue

Managed Availability

38

Exchange 2013 Managed Availability

Cloud Trained

Bringing the learnings from the service to the enterprise

User Focused

Monitoring based on the end user’s experience

Recovery Oriented

Protect the user’s experience through recovery oriented computing

Copyright© Microsoft Corporation

Cloud Trained 5+ Years of Directly Operating the Service

Since 2007, the Exchange Engineering Team has been operating a cloud version of Exchange

Learnings Are Put Back Into the Product Engineers are on-call for service related issues Drives the right accountability for awareness (noise/gap ratio) and

motivates the team toward auto-healing and recovery

Scale, Auto-Deployment, Optics, High Availability are key tenets Decentralized complex processing Rollouts don’t require extra configuration

User Focused

If you can’t measure it, you cannot manage it

AvailabilityCan I access the service?

LatencyHow is my experience?

ErrorsAm I able to accomplish what I want?

Availability

Errors

Latency

Customer Touch Points

Copyright© Microsoft Corporation

Recovery Oriented

—OWA send—OWA failure—OWA fast recovery—OWA verified as healthy —OWA send—OWA failure—OWA fast recovery—Failover server’s databases—OWA verified as healthy —Server becomes “good” failover target (again)

LB CAS-1

CAS-2

DAG

MBX-1

DB1 DB2

MBX-2

OWA DB1 DB2

MBX-3

OWA DB1 DB2

OWA

OWA

OWA

OWA DB1

DB1

“stuff breaks and the Experience does not”

Monitoring Layers

CAS

MBX

PROTOCOL

STORE

PROTOCOL PROXY

4

3

2

1

PROACTIVE REACTIVE

20s 5min 20min

System Level Checks1. Mailbox Self Test

• (e.g. OWA MST) [detection 5m]2. Protocol Self Test

• (e.g. OWA PST) [detection 20 secs]

3. Proxy Self Test• (e.g. OWA PrST) [detection 20

secs]

End User Experience Level Checks4. Customer Touch Point – CTP

• (e.g. OWA CTP) [detection 20m]

ProbesPROBES

The key goal is to measure the customer’s perception of the serviceThese are typically synthetic end to end customer transactions

CHECKSThe key goal is to measure actual customer traffic and become aware when they are experiencing issuesThese are typically implemented as performance counters where thresholds can be set to detect spikes in customer failures

NOTIFYThe key goal is to take action immediately based on a critical eventThese are typically exceptions or conditions that can be detected without a large sample set

PROBE

CHECK

NOTIFY

MonitorsMonitors query the data collected by the probes and determine if an action needs to occur based on a rule set

Depending on the rule, a monitor can escalate or initiate a responder

Monitors can be Healthy, Degraded, Unhealthy, Repairing, Disabled, or Unavailable

Defines the time from failure that a responder is executed

MONITOR

“state of the world”

ESCALATE

“take human driven action”

Responders

A responder is a “plug-in” that executes a response to an alert generated by a monitor

There are several types of respondersRestart Responder – Terminates and restarts serviceReset AppPool Responder – Cycles IIS application poolFailover Responder – Takes a MBX server out of serviceBugcheck Responder – Initiates a bugcheck of the serverOffline Responder- Takes a protocol on a machine out of serviceOnline Responder – Places a machine back into serviceEscalate Responder – escalates an issueSpecialized Component Responders

Built-in sequencing mechanism to control recovery actions

ESCALATE

“take human driven action”

RECOVER

“restore service or prevent

failure”

Copyright© Microsoft Corporation

MonitorStates

Managed Availability PipelineSampling Detection Recovery

Probe

Probe Definition

Monitor

Monitor Results (Alerts)

Monitor Definition

Responder

Responder Results

(Responses)

Responder Definition

Healthy

T1

T2

T3

00:00:00

00:00:10

00:00:30

Restart ResponderReset AppPool

ResponderFailover responder

Bugcheck responderOffline Responder

Escalate Responder

Sequenced HA Responder PipelineExample

Named Times

Probe Results

(Samples)

Notification Item

Copyright© Microsoft Corporation

Managed Availability Exchange Server Health Summary

Get-HealthReport -Identity <ServerName>

Get-HealthReport <ServerName> -RollupGroup

Get-HealthReport <ServerName> -RollupGroup -HealthSet <HealthSetName>

(Get-DatabaseAvailabiltyGroup dag1).Servers | Get-HealthReport –RollupGroup

Get-ServerHealth –Identity <ServerName> | ft Server,CurrentHealthSetState,Name,HealthSetName,AlertValue,HealthGroupName -auto

If a Health Set is Unhealthy, find out why Get-ServerHealth -Identity <Server Name> -HealthSet

<HealthSetName>

Copyright© Microsoft Corporation

For More Information Exchange Server Home Page - http://aka.ms/ExHome

Exchange Team Blog – http://aka.ms/EHLO Exchange 2013 Docs - http://aka.ms/E15docs Exchange 2013 RelNotes - http://aka.ms/E15RelNotes

Exchange 2013 Hybrid - http://aka.ms/E15Hybrid Exchange 2013 SDK - http://aka.ms/E15SDK

49

Questions?

50

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Exchange Server 2013 High Availability

Scott SchnollMicrosoft [email protected]://aka.ms/schnoll Twitter: @schnoll