Real World High Availability and Site Resilience Design Robert Gillies Solution Architect Microsoft Corporation EXL308

Real World High Availability and Site Resilience DesignRobert GilliesSolution ArchitectMicrosoft Corporation

EXL308

Agenda

Define High Availability and Site ResilienceDefine SLA, OLA, RTO, RPOHow do we ensure success with HA and SR?Some real world design examples

High Availability Deep Dive

This session is not about the technology deep diveFor deep technical, you want Scott Schnoll’s session

EXL401Microsoft Exchange Server 2010 High Availability Deep Dive

You might also look at TechEd Northamerica 2011 EXL327

My session from last year – more guidance and other scenarios presented there…

High Availability vs Site Resilience

Remember, that when Microsoft is talking about these…

High Availability is an automatic failover eventFor instance, a disk fails and another copy of the database comes online with no administrator actions necessary

Site Resilience is recovery from a full or partial failure of a datacenter, and is not automatic

For instance, your load balancer appliance fails, and because of this, users in one datacenter cannot access their emailSomeone has to make a decision to “fail the datacenters over”Administrators have to take action to cause the activation of the secondary datacenter

Why High Availability or Site Resilience?

When designing a system of any sort, you have to define your requirementsMany of the main requirements are defined in the SLA, whether formal or informal

How available must the system be?How do we define availability?What does “available 99.99% of the time” mean?What about maintenance periods?Do we get an availability break if we lose a datacenter?

Bottom line is that if you do not have a requirement for High Availability or Site Resilience, don’t deploy it!

SLAs and OLAs

OLA – Operational (or Operating) Level AgreementDefines the responsibilities of the various internal support organizations working to support an SLAFor instance, if DNS is down, who owns that? AD? Network?

SLA – Service Level AgreementDefines the level of service provided to a customerCould define, for instance, how available the messaging service is, or how quickly messages must be deliveredIs part of a contract, and as such could have financial impacts

Not all companies have OLAs and SLAs definedBut if not, how do you know if you need HA or SR?

9’s of Availability

How many of you can say you provide less than 5 minutes downtime a year?9’s converted to minutes…

99.999% = 5.26 minutes / year = 25.9 seconds / month99.99% = 52.56 minutes / year = 4.32 minutes / month99.95% = 4.38 hours / year = 21.56 minutes / month99.9% = 8.76 hours / year = 43.2 minutes / month

Discussion about “how many 9’s” is very difficultDoes maintenance time count against availability? Patch Tuesday happens every month!Does a partial outage count against the entire service? What if a single database is offline for an hour?How do you measure availability? What tools do you use?

RTOs and RPOs

RTO = Recovery Time ObjectiveHow long will it take to restore service?

RPO = Recovery Point ObjectiveHow much data could be lost by this outage?

For an Exchange project, we can fairly easily define RTOs and RPOs for many different failures

Loss of a single diskLoss of any two disks (could be two copies of a single database)Loss of a serverLoss of a datacenter

Typical RTO / RPO Numbers

Loss of a single diskRTO = expect 30 seconds, commit to “less than 5 minutes”RPO = less than 30 seconds (could lose a contact object or an appointment because they are not transported)

Loss of a single serverSame as a single disk – WHY?

Loss of the replication LAN connectionNo outage – WHY?

Loss of primary datacenterRTO = expect an hour, commit to “less than 4 hours” (can you do better?)RPO = very much depends on your WAN bandwidth, latency and how utilized (or over-utilized) the replication network is

RTO for Site Resilience

Can you commit to having the secondary datacenter online within 15 or 30 minutes of loss of the primary datacenter?

What if you work at the primary and have a total power loss?What if there is some natural (or malicious) disaster, and you have also lost your admin staff? Or, maybe they are more worried about their own families and have gone home.

What about a scenario where you have only a partial failure of a datacenter?

Loss of load balancer? Who makes the decision to fail over?What about loss of the WAN? How long will it be down? Would it make sense to fail the datacenter over for a 15 minute WAN outage?

Does your RTO clock start ticking as soon as the failure occurs?Who makes the decision to fail over???

Technology Alone is NOT the Answer

High Availability and Site Resilience both require People, Process and Technology to all do their part

Technology is the EASY part!

What happens if you don’t properly train your people?What happens if you don’t have enough people so that you don’t have 24x7 administration?

Do you require 24x7 service? Or is it OK to be down for a few hours until the Administrator can get in and get the service back up?

What happens if you define a great process, but you have people working for you that don’t follow process?What if your process is so strictly defined that your people can’t apply an emergency patch and get your service back?

Lossless Availability in Exchange 2010

How many of you think you can guarantee a lossless failover of Exchange 2010?You cannot guarantee thatExchange was not designed to guarantee lossless failoversExchange WAS designed to help you minimize the possible loss scenariosMany things we do to design HA and SR revolve around minimizing the amount of possible data loss

Faster replication WANMultiple replication LANs

If avoiding data loss is important to you, you should never do anything that would introduce a possibility of data loss

Success With Exchange 2010 HA and SR

When we talk about being successful with Exchange 2010, we typically want to reduce complexityA simple system reduces risk, and reduction of risk increases the chance of successWe have probably all heard of the KISS principal

And no, this has nothing to do with Gene Simmons…KISS is an acronym for Keep It Simple, Stupid!Or, as Albert Einstein said, “everything should be made as simple as possible, but not simpler”

Certain complexities sometimes are required, but if there is no driving requirement, keep it simple!

Example…

What Do We Mean By “Complex”?

Other Design Elements Not Shown…

Deployed on a SAN3 copies of the dataVirtualized deploymentBlade servers500 MB mailboxesUsing a third party archival solution with “stubbing”

Some of the Unnecessary Complexities…

Primary Exchange AD site is actually two physical locations

Requires that the latency is less than 10 ms between locationshttp://technet.microsoft.com/en-us/library/cc770917(WS.10).aspx

This causes all of the CAS and HT from the two locations to be treated as if they were in a single AD site

MBX will chose HT from either physical locationHT from other AD site will chose HT in either physical locationCAS in either AD site used, no matter where the database really is

http://technet.microsoft.com/en-us/library/cc770917(WS.10).aspx




Two layers of Client Access Servers?One for “External OWA”One for “Internal” traffic

Customers think that this allows for quick “shutoff” for Internet traffic

What about firewalls or reverse proxies or IPSec policies?This causes us to have more servers than we really needThis causes us to have Client Access role servers in the same AD site with configurations that are different


Separate Public Folder servers?Separate Hub Transport servers?Leads to more servers than we needLeads to multiple server builds

Now we are up to 5 builds!CAS ExternalCAS InternalHTMBXPublic Folder

If we had “multi-role”, we would have MORE HT servers!

Complexity = Risk

Every project has risks – risk in and of itself is not “bad”A “risk” is something that MIGHT happen

One risk might be “Getting power pulled into the datacenter on time might not happen because of the timing of the other project”

An “issue” is a risk that HAS happenedThe issue from our risk above might be “Power didn’t get pulled on time” and this issue is going to cause our project timeline to slip

Every complexity brings more risks to a project – for example

Multiple server builds means that you have five separate test passes when patches come out, and you therefore have five times the risk that something will get missed in the test passesBlade servers could mean that you have a risk of a single backplane failure taking down multiple servers in a single chassis

If Risk Isn’t Bad, Why Worry About It?

While risk itself isn’t bad, the more risk you have, the more chance you have that issues will happenThe more issues that happen, the more chance that your project will failToo many issues during the migration and you could miss deadlines, or lose data, or something similarToo many issues during production and you could miss your SLAs – if you are a hoster or provider, this costs money!Reduction of risk means a higher chance of success

Example…

What Do We Mean By “Simple”?

The Simple Exchange Architecture

Four copies of the dataDAG: Beyond the “A” (Written by Boris Lokhvitsky)Discusses the mathematics of why four copies is better than two, or why three HT servers in a site would be better than two, etc.

Multi-role serversIncreases the number of HT servers and CAS from normal deployments (yes, decreases CAS in this instance – right-sizing!)One server build, with the only difference possibly the public folder server (don’t necessarily need every server to host PFs)

Direct Attached Storage (DAS) using JBODBig, cheap drives – 7200 RPM 2TB or 3TB drivesLarge mailboxes (say 25GB), no external archival, no stubbing

Two datacenters means a more simple quorum model!

http://blogs.technet.com/b/exchange/archive/2011/09/16/dag-beyond-the-a.aspx

Example…

A Customer Journaling Solution

Customer Requires Journaling for 7 years

Three optionsUtilize native Exchange tools

Turn on Single Item RecoveryTurn on Litigation Hold for 7 yearsCatches every mailbox item including contacts

Utilize a third party journaling system with Journaling Mailboxes

Configure journaling mailboxesSet up journaling by database or user (licensing implications)

Utilize a third party journaling system with Journaling Recipients

Same basic configuration as with Journaling MailboxesEXCEPT – instead of journaling to a mailbox in Exchange, we journal to a mail enabled contact – the journaling system is an SMTP destination

But Robert, this is an HA and SR session…

What does journaling have to do with HA and SR?What if your journal mailbox goes offline?What if your journaling system goes offline in the primary datacenter?If you have compliance requirements, are they relaxed in the case of a disaster scenario?

Remember that engineering your messaging solution is not just about Exchange!

How do you do site resilience for the Journaling system?How about your journaling mailbox?

The easiest solution would be a hosted journaling systemSee EXL301 – Archiving in the Cloud with Exchange Online ArchivingJournal to a recipient, and let your transport HA ensure that the messages get delivered to the cloud!

Real World Journaling HA/SR

Customer decided that they needed a separate systemCould have just had larger mailboxes and kept every message for 7 years

Exchange like our simple example aboveTwo DAGs, four servers per DAG, four copies per database

We added a separate set of disks that were mirroredOn two DAG nodes, this was used for public foldersOn other two DAG nodes, used for journaling mailboxes

2 copies of journaling mailboxes, one in each datacenterNeeded slightly higher availability, so decided to RAIDNo need for more than 2 copies since we have RAID

We also told the journaling vendor that they needed to replicate their data

External Journaling Does Add Complexity

This does add complexityA separate system (operations, other storage, etc.)Interaction with Exchange 2010 (database design for journaling mailboxes, etc.)HA and SR complexities

These complexities add risk and cost bothBUT, if they are required, then you need to do this

We didn’t add other complexities that we didn’t needThis was the simple solution with the addition of public folders and journaling

Example…

Virtualization

Exchange HA & SR in a Virtualized Environment

In many instances, customers think that combination of Exchange HA (The DAG) and virtualized HA is betterWe need to be clear – the recommended architecture is Exchange HA and SR – use the DAG!It is supported to do LiveMigration and hypervisor clustering, but the recommendation is to not use those for High Availability or Site Resilience

One Other Scenario to Think About…

Exchange PG did work in code for when you have MBX and HT collocated – MBX will use the collocated HT as last resortScenario (if the code fix was not in there):

User1 sends email to User2Message submitted from MBX to collocated HTAfter the message is submitted, but before it is moved on to the next HT or submitted back into another mailbox, and before the transaction is replicated via Continuous Replication, the host fails, losing all storageIn this case, the message is in User1’s Sent Items folder (if that user is in cached mode), but will not be delivered, and there will be no NDRVery much a small time that this could happen, and even with the code fix, we are just reducing the time that this could happen

Back to Virtualization and HA

Exchange is not virtualization awareIf you have a MBX in one guest, and an HT in another guest, and they are collocated, you have effectively “worked around” the code that the Exchange PG didOur recommendation would be to NOT have this kind of collocation

Multi-Role Exchange is fine – collocation within the same guestTwo guests causing the collocation is not great

Is this really an HA question? Do you consider loss of data an HA scenario?

Multiple CAS or Multiple HT on a Single Host

Consider the following scenario4 Exchange servers in an AD site, each running CAS/HTEvery server is virtualizedYou are in a state where you have three of these guests on one host, and one on anotherThe host with three Exchange guests fails, and you are left with one CAS/HT server for that AD site

In this case, the remaining CAS/HT might become overloaded and quit taking new connections, causing an outageWe are designing for HA, and putting multiple Exchange guests on one host could open you up for outage scenarios

Example…

Quorum

Quorum in a “Lopsided” DAG…

Scenario12 node DAG

4 nodes in SiteA – VIP Location (Let’s say “Headquarters”)8 nodes in SiteB – “General Population” Location (Let’s say “Production”)

Network between sometimes gets a bit flakyWhat happens?

Sometimes you lose connectivity between locationsWhen you lose connectivity, quorum is maintained in SiteBVIPs then can’t access their email (no connectivity, databases move to SiteBThis is not what you wanted!

What Are Your Options?

You could just live with it… NOT a great option since you aren’t really meeting your requirements! First (realistic) option is to add DAG members that can vote, but do not host databases

Extra MBX role servers with no databasesCould be virtualized

Second (realistic) option is to implement KB 2494036

Introduces a new attribute on the cluster called NodeWeight (set using cluser.exe or the Get-ClusterNode cmdlet)Nodes with NodeWeight of “0” (zero) do not get a quorum voteSet the FSW to be in the VIP datacenter

http://support.microsoft.com/kb/2494036

To Wrap It All Up…

Everything you do should be based on your requirementsIf you don’t have a documented requirement for a feature, why are you implementing it?

Risks are not bad, but you should avoid introducing risks that you do not needHigher costs are bad, and you should avoid escalating costs where there are less expensive, more simple solutions that will meet your requirementsKeep your deployment as simple as you possibly can – push back on your complexities

If a requirement forces you to add a complexity, then OKIf there is no requirement driving it, then why would you add the risk?

And the last point…

Every option we have shown today is supportedEven the most complex system we discussed – nothing in there was unsupported

The question isn’t whether you CAN do something, it is whether you SHOULD do it

Related Content

EXL304 – An Inside View of Microsoft Exchange 2010 SP2

EXL305 – Microsoft Exchange Server 2010 SP2 Tips & Tricks

EXL301 – Archiving in the Cloud With Exchange Online Archiving (EOA)

EXL306 – Best Practices for Virtualizing Microsoft Exchange Server 2010

EXL401 – Microsoft Exchange Server 2010 High Availability Deep Dive

Geek Out with Perry Blog: http://blogs.technet.com/b/perryclarke/

Track Resources

Exchange Team Blog: http://blogs.technet.com/b/exchange/

Exchange TechNet Tech Center: http://technet.microsoft.com/exchange

MEC Website and Registration: http://www.mecisback.com/

http://blogs.technet.com/b/perryclarke/

http://blogs.technet.com/b/perryclarke/

http://blogs.technet.com/b/exchange/

http://blogs.technet.com/b/exchange/

http://technet.microsoft.com/exchange

http://www.mecisback.com/

http://www.mecisback.com/

Resources

Connect. Share. Discuss.

http://northamerica.msteched.com

Learning

Microsoft Certification & Training Resources

www.microsoft.com/learning

TechNet

Resources for IT Professionals

http://microsoft.com/technet

Resources for Developers

http://microsoft.com/msdn

http://northamerica.msteched.com/

http://www.microsoft.com/learning

http://microsoft.com/technet

http://microsoft.com/msdn

Complete an evaluation on CommNet and enter to win!

MS Tag

Scan the Tagto evaluate thissession now onmyTechEd Mobile

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to

be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS

PRESENTATION.

Documents

Real World High Availability and Site Resilience Design Robert Gillies Solution Architect Microsoft Corporation EXL308