Upload
randell-chase
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Real World High Availability and Site Resilience DesignRobert GilliesSolution ArchitectMicrosoft Corporation
EXL308
Agenda
Define High Availability and Site ResilienceDefine SLA, OLA, RTO, RPOHow do we ensure success with HA and SR?Some real world design examples
High Availability Deep Dive
This session is not about the technology deep diveFor deep technical, you want Scott Schnoll’s session
EXL401Microsoft Exchange Server 2010 High Availability Deep Dive
You might also look at TechEd Northamerica 2011 EXL327
My session from last year – more guidance and other scenarios presented there…
High Availability vs Site Resilience
Remember, that when Microsoft is talking about these…
High Availability is an automatic failover eventFor instance, a disk fails and another copy of the database comes online with no administrator actions necessary
Site Resilience is recovery from a full or partial failure of a datacenter, and is not automatic
For instance, your load balancer appliance fails, and because of this, users in one datacenter cannot access their emailSomeone has to make a decision to “fail the datacenters over”Administrators have to take action to cause the activation of the secondary datacenter
Why High Availability or Site Resilience?
When designing a system of any sort, you have to define your requirementsMany of the main requirements are defined in the SLA, whether formal or informal
How available must the system be?How do we define availability?What does “available 99.99% of the time” mean?What about maintenance periods?Do we get an availability break if we lose a datacenter?
Bottom line is that if you do not have a requirement for High Availability or Site Resilience, don’t deploy it!
SLAs and OLAs
OLA – Operational (or Operating) Level AgreementDefines the responsibilities of the various internal support organizations working to support an SLAFor instance, if DNS is down, who owns that? AD? Network?
SLA – Service Level AgreementDefines the level of service provided to a customerCould define, for instance, how available the messaging service is, or how quickly messages must be deliveredIs part of a contract, and as such could have financial impacts
Not all companies have OLAs and SLAs definedBut if not, how do you know if you need HA or SR?
9’s of Availability
How many of you can say you provide less than 5 minutes downtime a year?9’s converted to minutes…
99.999% = 5.26 minutes / year = 25.9 seconds / month99.99% = 52.56 minutes / year = 4.32 minutes / month99.95% = 4.38 hours / year = 21.56 minutes / month99.9% = 8.76 hours / year = 43.2 minutes / month
Discussion about “how many 9’s” is very difficultDoes maintenance time count against availability? Patch Tuesday happens every month!Does a partial outage count against the entire service? What if a single database is offline for an hour?How do you measure availability? What tools do you use?
RTOs and RPOs
RTO = Recovery Time ObjectiveHow long will it take to restore service?
RPO = Recovery Point ObjectiveHow much data could be lost by this outage?
For an Exchange project, we can fairly easily define RTOs and RPOs for many different failures
Loss of a single diskLoss of any two disks (could be two copies of a single database)Loss of a serverLoss of a datacenter
Typical RTO / RPO Numbers
Loss of a single diskRTO = expect 30 seconds, commit to “less than 5 minutes”RPO = less than 30 seconds (could lose a contact object or an appointment because they are not transported)
Loss of a single serverSame as a single disk – WHY?
Loss of the replication LAN connectionNo outage – WHY?
Loss of primary datacenterRTO = expect an hour, commit to “less than 4 hours” (can you do better?)RPO = very much depends on your WAN bandwidth, latency and how utilized (or over-utilized) the replication network is
RTO for Site Resilience
Can you commit to having the secondary datacenter online within 15 or 30 minutes of loss of the primary datacenter?
What if you work at the primary and have a total power loss?What if there is some natural (or malicious) disaster, and you have also lost your admin staff? Or, maybe they are more worried about their own families and have gone home.
What about a scenario where you have only a partial failure of a datacenter?
Loss of load balancer? Who makes the decision to fail over?What about loss of the WAN? How long will it be down? Would it make sense to fail the datacenter over for a 15 minute WAN outage?
Does your RTO clock start ticking as soon as the failure occurs?Who makes the decision to fail over???
Technology Alone is NOT the Answer
High Availability and Site Resilience both require People, Process and Technology to all do their part
Technology is the EASY part!
What happens if you don’t properly train your people?What happens if you don’t have enough people so that you don’t have 24x7 administration?
Do you require 24x7 service? Or is it OK to be down for a few hours until the Administrator can get in and get the service back up?
What happens if you define a great process, but you have people working for you that don’t follow process?What if your process is so strictly defined that your people can’t apply an emergency patch and get your service back?
Lossless Availability in Exchange 2010
How many of you think you can guarantee a lossless failover of Exchange 2010?You cannot guarantee thatExchange was not designed to guarantee lossless failoversExchange WAS designed to help you minimize the possible loss scenariosMany things we do to design HA and SR revolve around minimizing the amount of possible data loss
Faster replication WANMultiple replication LANs
If avoiding data loss is important to you, you should never do anything that would introduce a possibility of data loss
Success With Exchange 2010 HA and SR
When we talk about being successful with Exchange 2010, we typically want to reduce complexityA simple system reduces risk, and reduction of risk increases the chance of successWe have probably all heard of the KISS principal
And no, this has nothing to do with Gene Simmons…KISS is an acronym for Keep It Simple, Stupid!Or, as Albert Einstein said, “everything should be made as simple as possible, but not simpler”
Certain complexities sometimes are required, but if there is no driving requirement, keep it simple!
Example…
What Do We Mean By “Complex”?
Other Design Elements Not Shown…
Deployed on a SAN3 copies of the dataVirtualized deploymentBlade servers500 MB mailboxesUsing a third party archival solution with “stubbing”
Some of the Unnecessary Complexities…
Primary Exchange AD site is actually two physical locations
Requires that the latency is less than 10 ms between locationshttp://technet.microsoft.com/en-us/library/cc770917(WS.10).aspx
This causes all of the CAS and HT from the two locations to be treated as if they were in a single AD site
MBX will chose HT from either physical locationHT from other AD site will chose HT in either physical locationCAS in either AD site used, no matter where the database really is
Some of the Unnecessary Complexities…
Two layers of Client Access Servers?One for “External OWA”One for “Internal” traffic
Customers think that this allows for quick “shutoff” for Internet traffic
What about firewalls or reverse proxies or IPSec policies?This causes us to have more servers than we really needThis causes us to have Client Access role servers in the same AD site with configurations that are different
Some of the Unnecessary Complexities…
Separate Public Folder servers?Separate Hub Transport servers?Leads to more servers than we needLeads to multiple server builds
Now we are up to 5 builds!CAS ExternalCAS InternalHTMBXPublic Folder
If we had “multi-role”, we would have MORE HT servers!
Complexity = Risk
Every project has risks – risk in and of itself is not “bad”A “risk” is something that MIGHT happen
One risk might be “Getting power pulled into the datacenter on time might not happen because of the timing of the other project”
An “issue” is a risk that HAS happenedThe issue from our risk above might be “Power didn’t get pulled on time” and this issue is going to cause our project timeline to slip
Every complexity brings more risks to a project – for example
Multiple server builds means that you have five separate test passes when patches come out, and you therefore have five times the risk that something will get missed in the test passesBlade servers could mean that you have a risk of a single backplane failure taking down multiple servers in a single chassis
If Risk Isn’t Bad, Why Worry About It?
While risk itself isn’t bad, the more risk you have, the more chance you have that issues will happenThe more issues that happen, the more chance that your project will failToo many issues during the migration and you could miss deadlines, or lose data, or something similarToo many issues during production and you could miss your SLAs – if you are a hoster or provider, this costs money!Reduction of risk means a higher chance of success
Example…
What Do We Mean By “Simple”?
The Simple Exchange Architecture
Four copies of the dataDAG: Beyond the “A” (Written by Boris Lokhvitsky)Discusses the mathematics of why four copies is better than two, or why three HT servers in a site would be better than two, etc.
Multi-role serversIncreases the number of HT servers and CAS from normal deployments (yes, decreases CAS in this instance – right-sizing!)One server build, with the only difference possibly the public folder server (don’t necessarily need every server to host PFs)
Direct Attached Storage (DAS) using JBODBig, cheap drives – 7200 RPM 2TB or 3TB drivesLarge mailboxes (say 25GB), no external archival, no stubbing
Two datacenters means a more simple quorum model!
Example…
A Customer Journaling Solution
Customer Requires Journaling for 7 years
Three optionsUtilize native Exchange tools
Turn on Single Item RecoveryTurn on Litigation Hold for 7 yearsCatches every mailbox item including contacts
Utilize a third party journaling system with Journaling Mailboxes
Configure journaling mailboxesSet up journaling by database or user (licensing implications)
Utilize a third party journaling system with Journaling Recipients
Same basic configuration as with Journaling MailboxesEXCEPT – instead of journaling to a mailbox in Exchange, we journal to a mail enabled contact – the journaling system is an SMTP destination
But Robert, this is an HA and SR session…
What does journaling have to do with HA and SR?What if your journal mailbox goes offline?What if your journaling system goes offline in the primary datacenter?If you have compliance requirements, are they relaxed in the case of a disaster scenario?
Remember that engineering your messaging solution is not just about Exchange!
How do you do site resilience for the Journaling system?How about your journaling mailbox?
The easiest solution would be a hosted journaling systemSee EXL301 – Archiving in the Cloud with Exchange Online ArchivingJournal to a recipient, and let your transport HA ensure that the messages get delivered to the cloud!
Real World Journaling HA/SR
Customer decided that they needed a separate systemCould have just had larger mailboxes and kept every message for 7 years
Exchange like our simple example aboveTwo DAGs, four servers per DAG, four copies per database
We added a separate set of disks that were mirroredOn two DAG nodes, this was used for public foldersOn other two DAG nodes, used for journaling mailboxes
2 copies of journaling mailboxes, one in each datacenterNeeded slightly higher availability, so decided to RAIDNo need for more than 2 copies since we have RAID
We also told the journaling vendor that they needed to replicate their data
External Journaling Does Add Complexity
This does add complexityA separate system (operations, other storage, etc.)Interaction with Exchange 2010 (database design for journaling mailboxes, etc.)HA and SR complexities
These complexities add risk and cost bothBUT, if they are required, then you need to do this
We didn’t add other complexities that we didn’t needThis was the simple solution with the addition of public folders and journaling
Example…
Virtualization
Exchange HA & SR in a Virtualized Environment
In many instances, customers think that combination of Exchange HA (The DAG) and virtualized HA is betterWe need to be clear – the recommended architecture is Exchange HA and SR – use the DAG!It is supported to do LiveMigration and hypervisor clustering, but the recommendation is to not use those for High Availability or Site Resilience
One Other Scenario to Think About…
Exchange PG did work in code for when you have MBX and HT collocated – MBX will use the collocated HT as last resortScenario (if the code fix was not in there):
User1 sends email to User2Message submitted from MBX to collocated HTAfter the message is submitted, but before it is moved on to the next HT or submitted back into another mailbox, and before the transaction is replicated via Continuous Replication, the host fails, losing all storageIn this case, the message is in User1’s Sent Items folder (if that user is in cached mode), but will not be delivered, and there will be no NDRVery much a small time that this could happen, and even with the code fix, we are just reducing the time that this could happen
Back to Virtualization and HA
Exchange is not virtualization awareIf you have a MBX in one guest, and an HT in another guest, and they are collocated, you have effectively “worked around” the code that the Exchange PG didOur recommendation would be to NOT have this kind of collocation
Multi-Role Exchange is fine – collocation within the same guestTwo guests causing the collocation is not great
Is this really an HA question? Do you consider loss of data an HA scenario?
Multiple CAS or Multiple HT on a Single Host
Consider the following scenario4 Exchange servers in an AD site, each running CAS/HTEvery server is virtualizedYou are in a state where you have three of these guests on one host, and one on anotherThe host with three Exchange guests fails, and you are left with one CAS/HT server for that AD site
In this case, the remaining CAS/HT might become overloaded and quit taking new connections, causing an outageWe are designing for HA, and putting multiple Exchange guests on one host could open you up for outage scenarios
Example…
Quorum
Quorum in a “Lopsided” DAG…
Scenario12 node DAG
4 nodes in SiteA – VIP Location (Let’s say “Headquarters”)8 nodes in SiteB – “General Population” Location (Let’s say “Production”)
Network between sometimes gets a bit flakyWhat happens?
Sometimes you lose connectivity between locationsWhen you lose connectivity, quorum is maintained in SiteBVIPs then can’t access their email (no connectivity, databases move to SiteBThis is not what you wanted!
What Are Your Options?
You could just live with it… NOT a great option since you aren’t really meeting your requirements! First (realistic) option is to add DAG members that can vote, but do not host databases
Extra MBX role servers with no databasesCould be virtualized
Second (realistic) option is to implement KB 2494036
Introduces a new attribute on the cluster called NodeWeight (set using cluser.exe or the Get-ClusterNode cmdlet)Nodes with NodeWeight of “0” (zero) do not get a quorum voteSet the FSW to be in the VIP datacenter
To Wrap It All Up…
Everything you do should be based on your requirementsIf you don’t have a documented requirement for a feature, why are you implementing it?
Risks are not bad, but you should avoid introducing risks that you do not needHigher costs are bad, and you should avoid escalating costs where there are less expensive, more simple solutions that will meet your requirementsKeep your deployment as simple as you possibly can – push back on your complexities
If a requirement forces you to add a complexity, then OKIf there is no requirement driving it, then why would you add the risk?
And the last point…
Every option we have shown today is supportedEven the most complex system we discussed – nothing in there was unsupported
The question isn’t whether you CAN do something, it is whether you SHOULD do it
Related Content
EXL304 – An Inside View of Microsoft Exchange 2010 SP2
EXL305 – Microsoft Exchange Server 2010 SP2 Tips & Tricks
EXL301 – Archiving in the Cloud With Exchange Online Archiving (EOA)
EXL306 – Best Practices for Virtualizing Microsoft Exchange Server 2010
EXL401 – Microsoft Exchange Server 2010 High Availability Deep Dive
Geek Out with Perry Blog: http://blogs.technet.com/b/perryclarke/
Track Resources
Exchange Team Blog: http://blogs.technet.com/b/exchange/
Exchange TechNet Tech Center: http://technet.microsoft.com/exchange
MEC Website and Registration: http://www.mecisback.com/
Resources
Connect. Share. Discuss.
http://northamerica.msteched.com
Learning
Microsoft Certification & Training Resources
www.microsoft.com/learning
TechNet
Resources for IT Professionals
http://microsoft.com/technet
Resources for Developers
http://microsoft.com/msdn
Complete an evaluation on CommNet and enter to win!
MS Tag
Scan the Tagto evaluate thissession now onmyTechEd Mobile
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
PRESENTATION.