22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study SQL Server Technical Article Writer: David P. Smith (ServiceU Corporation) Contributors: Ron Talmage (Solid Quality Mentors); Sanjay Mishra, Prem Mehra Technical Reviewers: James Podgorski, Mike Weiner, Peter Carlin Published: August 2009 Applies to: SQL Server 2008 Summary: ServiceU Corporation is a leading provider of event management software. An essential part of our IT mission is maintaining high availability and disaster recovery capability. This technical case study shows how we use Windows failover clustering and SQL Server 2008 database mirroring to eliminate single points of failure in our data centers and enable fast recovery from a possible disaster at our primary data center. These strategies and solutions will be of interest to database administrators, senior IT managers, project leads, and architects. Introduction ServiceU Corporation, based in Memphis, Tennessee, is a leading provider of online and on-demand event management software. Our software services are used by churches, schools, universities, theaters, and businesses to manage events such as concerts and conferences as well as online payments. We have customers in all 50 states of the United States and in 15 countries worldwide. Our software services are built and deployed using the Microsoft® Application Platform, including the Microsoft .NET connection software, the Microsoft SQL Server® 2008 database software, and the Windows Server® 2008 operating system. The Microsoft Application Platform helps us provide a seamless user experience and maximum availability of our applications to users. The applications use both the Software as a Service (SaaS) model and the Microsoft Software + Services architecture. From a security standpoint, we maintain Payment Card MSDN Library Servers and Enterprise Development SQL Server SQL Server 2008 Technical Articles SQLCAT Articles Analysis and Reporting Services, Design and Implementation The Analysis Services 2008 Performance Guide Analysis Services ROLAP for SQL Server Data Warehouses Analyzing I/O Characteristics and Sizing Storage Systems for SQL Server Database Applications Data Compression: Strategy, Capacity Planning and Best Practices The Data Loading Performance Guide Disk Partition Alignment Best Practices for SQL Server Enterprise Policy Management Framework with SQL Server 2008 Fast Track Data Warehouse 2.0 Architecture High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study Manage Kerberos Authentication Issues in a Reporting Services Environment Service Broker: Performance and Scalability Techniques Scale-Out Querying for Analysis Services with Read-Only Databases Tuning the Performance of Change Data Capture in SQL Server 2008 SQL Server 2008 16 out of 16 rated this helpful - Rate this topic Search MSDN with Bing United States (English) Sign in Home Library Learn Samples Downloads Support Community Forums High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013 http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 1 / 22

High Availability and Disaster Recovery at ServiceU a SQL S

Embed Size (px)

DESCRIPTION

High Availability and Disaster Recovery at ServiceU A SQL

Citation preview

Page 1: High Availability and Disaster Recovery at ServiceU a SQL S

High Availability andDisaster Recovery atServiceU: A SQLServer 2008Technical CaseStudy

SQL Server Technical Article

Writer: David P. Smith (ServiceU Corporation)

Contributors: Ron Talmage (Solid Quality Mentors); SanjayMishra, Prem Mehra

Technical Reviewers: James Podgorski, Mike Weiner, PeterCarlin

Published: August 2009

Applies to: SQL Server 2008

Summary: ServiceU Corporation is a leading provider of eventmanagement software. An essential part of our IT mission ismaintaining high availability and disaster recovery capability.This technical case study shows how we use Windows failoverclustering and SQL Server 2008 database mirroring toeliminate single points of failure in our data centers andenable fast recovery from a possible disaster at our primarydata center. These strategies and solutions will be of interestto database administrators, senior IT managers, project leads,and architects.

IntroductionServiceU Corporation, based in Memphis, Tennessee, is aleading provider of online and on-demand eventmanagement software. Our software services are used bychurches, schools, universities, theaters, and businesses tomanage events such as concerts and conferences as well asonline payments. We have customers in all 50 states of theUnited States and in 15 countries worldwide.

Our software services are built and deployed using theMicrosoft® Application Platform, including the Microsoft.NET connection software, the Microsoft SQL Server® 2008database software, and the Windows Server® 2008 operatingsystem. The Microsoft Application Platform helps us provide aseamless user experience and maximum availability of ourapplications to users. The applications use both the Softwareas a Service (SaaS) model and the Microsoft Software +Services architecture.

From a security standpoint, we maintain Payment Card

MSDN Library

Servers and Enterprise Development

SQL Server

SQL Server 2008

Technical Articles

SQLCAT Articles

Analysis and Reporting Services,Design and Implementation

The Analysis Services 2008Performance Guide

Analysis Services ROLAP for SQLServer Data Warehouses

Analyzing I/O Characteristics andSizing Storage Systems for SQLServer Database Applications

Data Compression: Strategy,Capacity Planning and BestPractices

The Data Loading PerformanceGuide

Disk Partition Alignment BestPractices for SQL Server

Enterprise Policy ManagementFramework with SQL Server 2008

Fast Track Data Warehouse 2.0Architecture

High Availability and DisasterRecovery at ServiceU: A SQLServer 2008 Technical CaseStudy

Manage Kerberos AuthenticationIssues in a Reporting ServicesEnvironment

Service Broker: Performance andScalability Techniques

Scale-Out Querying for AnalysisServices with Read-OnlyDatabases

Tuning the Performance ofChange Data Capture in SQLServer 2008

SQL Server 200816 out of 16 rated this helpful - Rate this topic

Search MSDN with Bing

United States (English) Sign in

Home Library Learn Samples Downloads Support Community Forums

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 1 / 22

Page 2: High Availability and Disaster Recovery at ServiceU a SQL S

Industry (PCI) Level 1 compliance to protect credit card holderand Automated Clearing House (ACH) information. (Details ofour PCI Compliance are not covered in this case study.)

Achieving maximum availability and near immediate recoveryfrom a disaster is essential for maintaining our revenuestream. We have worked hard to eliminate all single points offailure in our architecture, and we have developed proceduresfor patching servers, upgrading software, and implementingapplication changes that preserve high availability. Based onthese efforts, we have achieved 99.99 percent uptime,including both planned and unplanned downtime.

This case study examines the decisions that we made and theprocedures we employed to maintain maximum availabilitywith minimal downtime. This information will be of interest tosenior IT managers, project leads, architects, and databaseadministrators (DBAs).

The ServiceUApplicationArchitectureA logical view of our application architecture is shown inFigure 1.

Figure 1: A logical view of the ServiceU applicationarchitecture showing the application tier layers

Note the following about our architecture:

Our customers can access our application directlythrough their browsers, their own Web servers, andfrom their own e-commerce servers.All customer activity is processed through our Webfarm that holds the middle-tier layer.The end-user application is built with Microsofttechnologies through a series of layers, all of whicheventually go through the Data Access Layer tocontact the application databases.The data layer consists of SQL Server 2008 databases.

In order to maintain Level 1 PCI Compliance, rigorous securitymeasures are enforced to protect user cardholder data. Inaddition, we maintain service level agreements (SLAs) withcustomers that specify required levels of performance andavailability of the application.

Availability Goals

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 2 / 22

Page 3: High Availability and Disaster Recovery at ServiceU a SQL S

Our revenue stream is based on customer activity.Consequently, it is vitally important that our applicationmaintain maximum uptime and availability.

We keep the following general goals in mind for ouravailability solutions:

Ensure that all PCI Compliance security measures areapplied throughout the network: If a standby datacenter is used for disaster recovery, it must also be PCIcompliant.Elim inate all single points of failure: from the Internetpresence to the data center, including network, Weband database servers, and data storage.

To help achieve our uptime goals and meet desired servicelevel agreements (SLAs), we created specific guidelines forallowable data loss and service downtime. These objectiveswere defined by recovery point objectives (RPOs) and recoverytime objectives (RTOs) as discussed in the following list:

Unplanned downtime:Loss of a database server:

RPO = 0; that is, no data lossRTO = 60 seconds maximum

Loss of the primary data center, or the entiredatabase storage unit in the primary datacenter:

RPO = 3 minutes maximum; lost datamay be recovered if the primary datacenter can be made availableRTO = 15 minutes total, includingevaluation of the issue; 5 minutesmaximum for making the necessarychanges to bring the standby datacenter online

Planned downtime:RPO = 0 (no data loss)RTO = 60 seconds maximum; some databasechanges may require a longer downtime than60 seconds; in those cases every effort is madeto minimize the service interruption

High AvailabilityTo implement high availability within the data center, wedecided to implement Windows® failover clustering and withstorage area network (SAN) database storage:

Windows failover clustering provides database serverredundancy within a data center.Each failover cluster has fully-redundant SAN storagefor data protection within a data center.We use three nodes in each cluster to preserve highavailability during patches and upgrades.

Figure 2 shows our high availability architecture.

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 3 / 22

Page 4: High Availability and Disaster Recovery at ServiceU a SQL S

Figure 2: ServiceU uses a three-node Windows failover clusterwith one clustered SQL Server instance

The next two sections describe the redundant server andstorage strategies illustrated in Figure 2.

Database Server Protection:Failover ClusteringFor database high availability, we have chosen to use a three-node Windows failover cluster with one SQL Server 2008failover cluster instance:

If any single cluster node fails, there are always tworemaining nodes available in the cluster, so the SQLServer instance is still protected by a redundant nodeon the cluster.An upgrade or patch can be applied to one node at atime (commonly called a rolling upgrade or update),leaving two nodes always available. This preserveshigh availability during scheduled maintenance.We can replace an existing node with a newer nodeand have two nodes available for failover during thereplacement.

Each node of the cluster has two host-bus adapters (HBAs)connecting it to the Fibre Channel switches, each following aredundant path to the data. The cluster heartbeat networkuses a private network, not a crossover cable.

Before upgrading to SQL Server 2008, we used three-nodeWindows Server 2003 failover clusters with a single SQL Server2005 instance. When we upgraded to SQL Server 2008, we alsoupgraded to Windows Server 2008. To continue using a three-node failover cluster on Windows Server 2008, weimplemented Windows Server 2008 failover clustering usingthe No Majority: Disk Only mode.

With an odd number of nodes in a Windows Server 2008failover cluster, Microsoft usually recommends using a NodeMajority quorum mode. However, for us, the Node Majorityquorum mode represents a relative loss of availability whenonly one node is operational:

In a Node Majority quorum mode, each node that isavailable and in communication can vote to determinewhether the cluster can continue running.For a three-node cluster, Node Majority mode willkeep its quorum if one node is not available (two votesbeing a majority of three nodes).However, if two nodes are not available, the cluster willlose its quorum because the one node remaining has

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 4 / 22

Page 5: High Availability and Disaster Recovery at ServiceU a SQL S

only one vote. This means that the entire cluster willbe offline.We require that even if only one node of the cluster isavailable, the cluster must remain online.

As a result, we chose to configure our three-node clustersusing the No Majority: Disk Only quorum mode. In this mode,a cluster will still run with only one node available. This isequivalent to the quorum disk model of failover clustering onprevious versions of Windows Server.

To protect the quorum disk, we place it on the SAN with itsown logical unit number (LUN). The database server connectsto a SAN with fully redundant hardware by using multipleredundant paths. On the SAN, the quorum disk's LUN volumeis mirrored using RAID. Because we have protected thequorum disk, we ignore the warning Windows Server 2008gives in the Failover Cluster Management utility, stating thatthe Quorum Disk Only option may be a single point of failure.

We made the following decisions when building our clusters:

On each three-node cluster, two nodes are designatedas preferred owners and have identical memory andCPU configuration. The third node, used primarily as abackup during patches and upgrades, has lessmemory. We implement a startup stored procedure toset the SQL Server memory based on detection of theactive node.All resources are set to fail over if a restart isunsuccessful.Failover between cluster nodes is automatic, but thecluster is set to , which prevents failback to thepreferred node. We will fail back to the preferred nodemanually when convenient.

Database Storage Protection: SANand RAIDTo eliminate single points of failure for database storage, weuse a Fibre Channel SAN. (For an illustration of how thedatabases are connected to the SAN, see Figure 2.) The SANconfiguration has numerous redundant features to prevent asingle point of failure, from the servers down to disk storage:

All SQL Server database servers have two HBAs; each isconnected to a different fiber switch.Each fiber switch then has a path to each loop of theSAN.Two fiber loops connect the disk enclosures.Each SAN has multiple storage processors to processthe data; these are balanced so that each of the SAN'sstorage processors has approximately 50 percent of theload.The SAN backplane is fully redundant.Caching is implemented on the SAN to improvethroughput, and the cache is fully mirrored.The SANs have additional batteries to back up powersupplies. If a battery is not fully charged or if the SANbegins to run on battery power, the SAN cache isflushed to disk and disabled until both batteries arefully charged.Most LUNs are configured with RAID 10 and enoughspindles are allocated for more than the maximumcurrent throughput.

Web Server Protection: NLBClusteringTo prevent single points of failure within Web servers, weplace all Web servers in a load-balanced Web farm usingMicrosoft Network Load Balancing (NLB):

The web server session state is stored in a database toprevent issues with load balancing.NLB affinity is set to None, allowing each request to goback to a different Web server.Web content is replicated between servers and sitesusing Windows Distributed File System Replication(DFSR).Each Web server has “Web gardens”, that is, multipleprocesses and application pools that isolate code forreliability.

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 5 / 22

Page 6: High Availability and Disaster Recovery at ServiceU a SQL S

We have converted most Web servers to virtual serversusing the Hyper-V™ technology.We have found that for configuring virtual servers,performance is better when the virtual guest machineVHD files reside on a physical disk volume that isseparate from those used for the host operatingsystem. This observation has led us to host our Webserver VHDs on a local mirrored RAID disk volume thatis separate from the operating system disk volume.

In our Web farm, several servers can fail or be removed withlittle or no impact. As a result, upgrades can be applied to oneWeb server at a time:

1. The Web server is removed from the Web farm.2. Code is applied.3. Testing is performed on the Web server.4. The server is placed back in the Web farm.

Disaster RecoveryTo protect against the potential loss of a primary data center,we located a second standby data center in a differentgeographical location. The standby data center serves as thedisaster recovery (DR) site should a natural disaster or otherdisruptive event result in the primary data center becomingnonoperational. The standby data center is used only in thecase of emergencies, when the primary data center isunavailable. When the primary data center becomes availableagain, we reestablish it as the primary data center and thestandby data center takes on its role of protecting the primarydata center.

Data from the primary data center is sent to the standby datacenter in near real-time, and the standby data center is afunctional duplicate of the primary data center's hardware,software, and infrastructure. In the event of the loss of theprimary data center, the standby data center can be broughtonline almost immediately, with minimal disruption ofcustomer activity. The following sections detail our disasterrecovery strategies.

Data Center SelectionWe required that the primary and standby data center sites belocated sufficiently far apart from each other that a naturaldisaster affecting one would not likely affect the other. Wealso required an Internet Service Provider (ISP) for our datacenters that could provide direct connectivity between thequalified cities, and could provide minimal network latencyand high availability of each data center to the Internet.Finally, we wanted the primary and standby data center citiesto be located near each other and to the corporateheadquarters, to lim it air fare costs and flight time.

Using these guidelines, we chose to locate our primary datacenter near our headquarters in Memphis, Tennessee, andselected Atlanta, Georgia, for our standby data center. Wechose an ISP that provided direct connectivity to the cities ofMemphis and Atlanta and could deliver the high throughputand low latency required to replicate or mirror data betweenthe sites:

Both data centers are situated at the junction of a dualfiber SONET ring. The ring is autorouting, providingprotection against connectivity failure: Multiple cuts inthe fiber will not affect Internet service. We have notencountered an unplanned loss of connectivity in nineyears.The ISP has a Point of Presence (POP) in each city.Network latency between the two cities is low (10milliseconds or less), which is a direct result of the factthat the ISP has a POP in both cities.The ISP provides both sites with a 30-megabits-per-second (mbps) connection to the Internet.

PCI Level 1 compliance requires that a disaster recovery site bePCI compliant before it can be used as a failover site. Becausewe have PCI Compliance as a company goal, we ensure thatthe standby data center meets the same PCI requirements asthe primary data center.

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 6 / 22

Page 7: High Availability and Disaster Recovery at ServiceU a SQL S

Data Center InfrastructureEach data center has a distinct production network. Both sitesare independent of each other with a fully functionalinfrastructure, from the firewalls through the Web servers anddatabase servers. A high-level view of the data centerinfrastructure is shown in Figure 3.

Figure 3: Each data center contains redundant hardware andmultiple connectivity paths

We use redundant hardware and multiple connectivity pathsto eliminate single points of failure within each data center.Figure 3 shows some, but not all, of the efforts that we havemade to remove single points of failure:

Active/Passive firewalls exist between each network.An NLB cluster balances incoming traffic across theWeb farm.Multiple Web servers host the application code.At least two Domain Name System (DNS) servers existat each data center. Each Web server uses a DNS aliasfor the server name when connecting to the SQLServer 2008 instance.The SQL Server 2008 instances are clustered usingWindows Server 2008 failover clustering. If one of thenodes fails, the SQL Server 2008 instance will fail overto another node in the cluster.Database data and the Windows failover clusterquorum disk resource are stored on a SAN. Thefailover cluster has duplicate paths to the SAN, and theSAN LUNs are provisioned using RAID striping andmirroring.

In addition:

The database servers and SAN have redundant powersupplies with uninterruptible power supplies (UPSs).Each data center has its own local power generator toprotect against temporary loss of power from theelectrical grid.

For more information about the data centers, see Appendix A,"Data Center Infrastructure Details".

Disaster Recovery Solution:Database MirroringIn our environment, database server disaster recovery isaccomplished by means of SQL Server 2008 databasemirroring between the primary and standby data centers. (SQLServer instance high availability is accomplished by usingWindows Server failover clustering on each database server,and was covered in the previous section.) Figure 4 shows alogical view of the database mirroring between the datacenters.

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 7 / 22

Page 8: High Availability and Disaster Recovery at ServiceU a SQL S

Figure 4: ServiceU implements disaster recovery between datacenters by using asynchronous database mirroring

Database mirroring ensures a near real-time copy of allmission critical database data at the standby data center:

The principal databases are located in the primary datacenter in Memphis, and the mirror databases are in thestandby data center in Atlanta.Low latency between the data centers benefitsdatabase mirroring in two major ways:

The principal server can send a large volume oftransaction log records to the mirror quickly.The time required to ship transaction logs tothe mirror server is decreased when initializingdatabase mirroring.

We chose asynchronous database mirroring because we donot want bandwidth-related delays of synchronous databasemirroring to affect application performance. In asynchronousdatabase mirroring, the principal server does not wait foracknowledgement from the mirror server before committingtransactions. Therefore delays in sending transaction logrecords to the mirror databases will not affect the completionof user transactions.

Because we have chosen asynchronous database mirroring, inthe event of the loss of the primary data center, some unsenttransactions may not be present on the mirror database. If thathappens, we will retrieve unsent data from the old primarydata center at Memphis if the data can be recovered from itwhen the Memphis data center databases come back online.The previously unsent data can be loaded into the standby(Atlanta) data center databases from the Memphis databaseswithout any primary key conflicts. This is possible becausecare was taken in assigning the keys in Atlanta during theperiod it assumed the production server role. After thefailover, when the Atlanta data center assumes the productionrole, we use a script to skip a generous range of keys that mayhave been used by the transactions whose log records wereunsent from the primary Memphis data center at the time ofthe failure (for more information, see the section "IdentityColumn Increment" later in this paper). A gap in the keys'sequencing will exist between the last used at Memphis andthe new ones assigned at Atlanta, but this is acceptable to theapplication. If the databases at the primary data center cannotbe recovered, the unsent data will be lost.

There are nearly 30 databases on the main database instance,some of them interrelated, and all are mirrored to the standbydata center. We base our database mirroring configuration onextensive testing before deployment.

Configuring the Mirror ServerFor logins and SQL Server Agent jobs, we use scripts to keepthe mirror SQL Server 2008 instance as current as possible withthe principal instance:

New logins are created at both data centers usingscripts.SQL Server Agent jobs are scripted and applied to boththe principal and mirror servers.

We keep all SQL Server Agent jobs active on both the principaland mirror server. Most job steps (all those that apply todatabases) begin with the command:

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 8 / 22

Page 9: High Availability and Disaster Recovery at ServiceU a SQL S

The user-defined function cf_IsMirrorInstance() accesses thesys.database_mirroring Dynamic Management View (DMV)and returns a value of 1 when executed on the mirror instance.As a result, the SQL Server Agent jobs on the mirror instancethat reference mirrored databases can remain active. They willsucceed but not do anything while their server remains themirrored server. (See Appendix D "Scripts", for thecf_IsMirrorInstance() source code.)

New or changed database permissions are also scripted, andscripts are kept in a secured location. If a failover to thestandby data center occurs, these scripts are run on thestandby data center's SQL Server instance after the databaseshave been recovered.

Using Log Shipping to Help Set UpDatabase MirroringOur database administrators have found that using SQL Serverlog shipping to assist in setting up database mirroring enablesthem to deploy database mirroring at their convenience. Weuse log shipping to automatically transfer the requireddatabases and transaction logs. After log shipping is running,database mirroring can be initialized at a later time. DBAs donot have to initialize database mirroring immediately aftertransferring and applying the database and transaction logbackups to the standby data center.

We use log shipping to help set up mirroring in severalcontexts, including setting up mirroring from the primary tothe standby data center, as well as within the primary datacenter when upgrading SQL Server. The following steps showhow we use log shipping to help set up database mirroringfrom the primary to the standby data center:

1. Disable the database transaction log backup jobs onthe primary data center SQL Server 2008 instance. Weback updatabase transaction logs hourly using a SQLServer Agent job. If an hourly transaction log backupjob were to run at the same time as the log shippingtransaction log backup job, the two backups wouldcause a break in log shipping's transaction log chainand log shipping would fail. Therefore we disable thehourly maintenance SQL Agent backup job. Weschedule the log shipping transaction log backup jobswith a frequency greater than once per hour, usuallythe default of every 15 minutes, ensuring that the logshipping transaction log backup job provides at leastthe same or better level of protection as the hourlytransaction log backup jobs.

2. Set up log shipping for each database at the primarydata center to the standby data center. Each databaseat the primary data center will be a log shippingprimary, and with a corresponding log shippingsecondary database at the standby data center. Whensetting up log shipping between SQL Server 2008instances, we use backup compression. Database andlog backups and restores are faster when compressed,and because the resulting backup files are smaller,copying them from the log shipping primary to thelog shipping secondary takes less time.

3. After log shipping job is restoring transaction logbackups on a regular basis, we wait for an appropriatelow-traffic time to switch to database mirroring. Thishelps reduce any delays that database activity mightcause to database mirroring initialization.

4. Disable the log shipping transaction log backup job onthe primary data center's SQL Server 2008 instance (thelog shipping primary: LS_Backup job.)

5. At the standby data center, run the LS_Copy andLS_Restore jobs until all transaction log backups fromthe log shipping primary databases have been copiedand restored.

6. Disable the log shipping jobs on the standby datacenter's SQL Server 2008 instance.

7. Set up database mirroring for each database from theprimary to the standby data center. Because thedatabases have been kept current as of the lasttransaction log backup by log shipping, this processsimply involves stepping through the Database

IF dbo.cf_IsMirrorInstance() = 1 RETURN

Transact-SQLHigh Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 9 / 22

Page 10: High Availability and Disaster Recovery at ServiceU a SQL S

Mirroring wizard (or running a script).8. Remove log shipping for each database.9. Enable the hourly transaction log backup jobs on the

principal server.

For more information about using database mirroring with logshipping, see the Microsoft white paper Database Mirroringand Log Shipping Working Together(http://sqlcat.com/whitepapers/archive/2008/01/21/database-mirroring-and-log-shipping-working-together.aspx).

Failing Over to the Standby DataCenterFailover to the standby data center addresses unplanneddowntime in the event of the complete loss of the primarydata center. The standby data center is normally unused; notraffic is directed to it except when the primary data center isunavailable. We monitor the standby data centercontinuously. Once per year, a more labor-intensive andthorough failover test is performed. This controlled failover islim ited to one or two databases that do not havedependencies on other databases.

We use a Domain Name System (DNS)-based approach todirect customers to the appropriate data center. If the primarydata center is lost, this DNS-based solution directs ourcustomers to the standby data center.

Within each data center, we use aliases and internal DNSservers to direct Web servers' database connections to theappropriate SQL Server instance:

Each data center has at least two DNS servers.The Web servers connect to the correct clustered SQLServer 2008 instance using a combination of a DNSalias (a CNAME record) and instance name, rather thanthe actual server name and instance name.

When the primary data center is active, DNS servers at bothdata centers direct the Web servers to connect to the primarydata center's SQL Server 2008 instance.

We automatically and continuously monitor the applicationservers at the standby data center with functional tests andsecurity scans:

The application/Web servers at the standby datacenter are continuously tested to assure that the Webservers are functioning correctly and are ready to beused.We also run vulnerability scans for PCI Compliance atthe standby data center to ensure it is always PCIcompliant.

In the event of a disaster at the primary data center, we willtake the following steps to redirect Web servers to the correctSQL Server database server. The following list includes onlythe major steps and is not exhaustive:

1. The primary data center is confirmed to be down andthe decision is made to fail over to the standby datacenter.

2. We use our DNS-based solution to direct customers tothe appropriate IP address of the standby data center.

3. We then use scripts to perform the followingoperations, including:

Recover the standby SQL Server 2008databases.Apply correct user permissions to the SQLServer 2008 databases.Increment the identity columns at the standbydata center's SQL Server database tables (see"Identity Column Increment" later in thispaper.)Change the standby data center's internal DNSalias to the standby data center's databaseserver name.

Bringing the Primary Data CenterBack Online Following a DisasterAs noted previously, when the primary data center initially

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 10 / 22

Page 11: High Availability and Disaster Recovery at ServiceU a SQL S

becomes available following a disaster, we will first find andretrieve any unsent data in the primary data center's databasetables, and then load that into the appropriate databases atthe standby data center. Because the identity columns of thetables at the standby data center will have used values from ahigher range, the unsent primary data center data can beloaded directly, keeping all identity values intact. If theprimary data center's data cannot be accessed because ofdamage due to disaster, that unsent data will be lost.

The standby data center is not meant as a permanentsubstitute for the primary data center, but as a backup in caseof emergencies. When the primary data center comes backonline, we will perform the previous steps to reverse the rolesso that the Memphis data center becomes the primary datacenter, and then we will reestablish the Atlanta data center tothe role of standby.

We will use log shipping from the standby data center to theprimary data center in order to prepare the databases fordatabase mirroring. After database mirroring is operationalfrom the standby data center to the primary, during a low-traffic period the direction of mirroring will be reversed, andall DNS aliases adjusted so that the primary data center againassumes its original role.

MonitoringWe have implemented appropriate monitoring to send alertsquickly when potential problems are detected.

Network and Web ServersWe continuously monitor system performance, with mostconcern devoted to performance during peak periods. Suchtraffic patterns occur during a five-to-seven-hour window.Rather than judge average performance over 24 hours, wecompute averages based on values from the peak periods, andfocus on the short bursts of peak activity (3-10 minutes)during that window. Whenever a server begins to run at 60percent of load capacity during peak periods, we begin theprocess of planning the upgrade of that equipment.

We monitor the health of each data center from threelocations: the primary data center, standby data center, andcorporate offices, receiving alerts via e-mail and/or textmessages.

General server monitoring includes:

General server availability: whether the server isrunning or not.Cluster service per node and for the active node forSQL Server services.The ability to connect remotely to the event log ofeach server.Excess server memory usage.CPU usage above a certain threshold for all processors.Hard disk space usage on all local and SAN drives.Active Directory® servers via a simple LightweightDirectory Access Protocol (LDAP) query.

Web servers and services monitoring includes:

Microsoft Internet Information Services (IIS) Admin,WWW Publishing, custom encryption applicationservice, and specific Web sites.Monitoring Services (additional services that check onthe health of Web sites).Response time and content of Web pages.

Monitoring Database MirroringWe monitor database mirroring by using the SQL Server 2008Database Mirroring Monitor Job. We use scripts to set upmonitoring for a selected pair of counters, one each at theprincipal and mirror instances. For all m irrored databases, wetrack the following two counters and set thresholds for theiralerts. The resulting SQL Server Agent jobs run once perminute:

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 11 / 22

Page 12: High Availability and Disaster Recovery at ServiceU a SQL S

Age of oldest unsent transaction (set up at theprincipal server): reports the age in minutes of theoldest unsent transaction in the send queue at theprincipal server. We set an alert at three minutes foreach mirrored database.

We run an initial script on the principal thatlooks for any databases participating indatabase mirroring, andsets a baseline valuefor the "Age of oldest unsent transaction" foreach database. All mirrored databases initiallyget the same setting.We run a second script also on the principalserver that adjusts the threshold value for anydatabases which may need a different value.We may assign differing threshold values fordatabases based on varying patterns of updateactivity.We use this counter to monitor potential dataloss in the event of an unplanned loss of theprimary data center. The "Age of oldest unsenttransaction" counter helps us ensure that itstays within its recovery point objective (RPO)of three minutes (see "Availability Goals" earlierin this paper.)

Unrestored log threshold (set up at the mirror server):helps estimate how long it would take the mirrorserver to roll forward the log records remaining in itsredo queue. We send an alert if the redo queueexceeds a certain threshold, usually between 250kilobytes (KB) and 500 KB. The actual value maychange for each database depending upon thedatabase's workload and behavior patterns.

We run an initial script on the mirror that looksfor any databases participating in databasemirroring. It sets a baseline log threshold valuefor each database. Each mirrored database getsthe same initial setting.We run a second script on the mirror serverthat adjusts the threshold value for anydatabases that may need a different value dueto differing patterns of update activity.

Because we use asynchronous mirroring, we do not monitorthe "Mirror commit overhead" counter. In addition to thedatabase mirroring monitoring counters, we also monitor thespace used and available free space for log and data volumesat each server.

We use Windows Management Instrumentation (WMI) Alertsto monitor lock escalation and deadlocks. To minimize lockescalation issues that occur during reporting, we are currentlytesting Read Committed Snapshot Isolation (RCSI).

Unplanned DowntimeScenariosIn the past, we have encountered a few events causingunplanned downtime, leading us to adopt the followingstrategies.

Suspended MirroringWe have observed that under certain conditions, databasemirroring sessions may enter a suspended state. When adatabase mirroring session is suspended, the principaldatabase's transaction log records cannot be sent to themirror. Because transaction log backups on the principal willno longer truncate the transaction logs if transaction logrecords cannot be sent to the mirror, the log files grow. Wehave established SQL Server Agent jobs that monitor and alertif any database participating in mirroring is in a state otherthan SYNCHRONIZING or SYNCHRONIZED (see the query inAppendix D.) After the underlying issue has been addressed,the mirroring session can easily be resumed using a script orSQL Server Management Studio.

Identity Column IncrementWe use asynchronous database mirroring. In the case of adisaster involving the loss of connectivity to the primary datacenter, some log records from the principal databases may be

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 12 / 22

Page 13: High Availability and Disaster Recovery at ServiceU a SQL S

prevented from being applied to the mirrored databases. Thiscould result in inconsistencies between databases that areinterrelated from an application perspective. The missing datastill exists in the primary data center in Memphis, but it maybe days or weeks before it can be applied to the standby datacenter in Atlanta.

Our databases have many tables that include identitycolumns. Because the databases are interrelated, one databasemay refer to identity values in another database. After afailover to the standby data center due to a disaster, unappliedlog records that can no longer be sent from the principalcould mean that a database may refer to an identity value thatdoes not exist in the table of another database. We considerthis extremely significant and have developed a methodologyto prevent data integrity issues that could be caused by usingthe identity values that have not been sent or applied.

During recovery of the mirrored databases on the failed oversite, a script runs on every table in every database, reseedingthe identity value to increment it by a certain number, andlogging the change with new and highest old values. Whenthe primary data center comes online again, we can query theformer principal database server's data (assuming it isreadable) and retrieve appropriate rows to populate themissing values to bring the tables into consistency across allthe databases on the new principal.

Planned DowntimeScenariosIn the past, we have encountered a few events causingunplanned downtime, leading us to adopt the followingstrategies.

Steps for Upgrading to SQL Server2008 from SQL Server 2005When we decided to upgrade from SQL Server 2005 to SQLServer 2008, we also decided to upgrade from Windows Server2003 to Windows Server 2008. After extensive planning, weaccomplished the upgrade with minimal downtime.

When upgrading from Windows Server 2003 to WindowsServer 2008, we decided also to reformat the storage LUNs forthe Windows Server 2003 failover cluster at the primary datacenter, and to upgrade to new database servers at the standbydata center. As a result, we chose to rebuild the failover clusterat the primary data center, and build a new failover cluster atthe standby data center. We built a temporary SQL Server 2008clustered instance on spare servers and used it to keep the SQLServer databases available while the primary data center'sfailover cluster was rebuilt.

This section lists the steps we took to perform the upgrade,showing how we were able to preserve high availability whileminimizing user downtime. In these steps, the followingabbreviations will be used. Each of the following SQL Serverinstances is clustered:

primarySQL2005: the legacy SQL Server 2005 instanceat the primary data centerstandbySQL2005: the legacy SQL Server 2005 instanceat the standby data centertempSQL2008: a temporary SQL Server 2008 instance atthe primary data centerprimarySQL2008: the new SQL Server 2008 instance atthe primary data centerstandbySQL2008: the new SQL Server 2008 instance atthe standby data center

The following steps illustrate the process our team used toupgrade to SQL Server 2008.

Phase 1: Redirected Application Users to aTemporary SQL Server 2008 Instance

1. Configured a temporary two-node SQL Server 2008

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 13 / 22

Page 14: High Availability and Disaster Recovery at ServiceU a SQL S

cluster.For the temporary SQL Server 2008 cluster,called tempSQL2008, only two cluster nodeswere used. The instance would only be onlinefor a few off-peak hours.The servers for tempSQL2008 were configuredwith Windows Server 2008, clustered, and aclustered instance of SQL Server 2008 installed.For tempSQL2008 data storage, we added DiskAccess Enclosures (DAEs) with additional diskdrives to the existing EMC CX-Series array.The tempSQL2008 server used a Fibre Channelpath via the same equipment as theproduction SQL Server 2005 clustered instance,going through the same fiber optic switches.The tempSQL2008 server level settings wereconfigured and thoroughly tested.

2. Stopped database mirroring to the standby datacenter.

3. Set up log shipping from primarySQL2005 totempSQL2008.

We use log shipping to help prepare databasesfor database mirroring. (For more information,see "Using Log Shipping to Help Set UpDatabase Mirroring" earlier in this paper.)We could not use backup compression toassist in setting up log shipping in this step,because backup compression is only availablebetween SQL Server 2008 instances, and theprimarySQL2005 instance was running SQLServer 2005.

4. Initialized asynchronous database mirroring fromprimarySQL2005 to tempSQL2008.

Accomplished by converting from logshipping to database mirroring.

5. Waited for a very low traffic period beforebeginning the upgrade process.

6. Converted all database mirroring sessions tosynchronous database mirroring

Waited for synchronization to occur.7. Used the firewall to redirect all incoming traffic to a

“scheduled downtime” Web site.All Web servers have the same configuration,and each hosts a Web site for the purpose ofhandling downtime messages. This Web siteresponds appropriately to Web servicerequests.Application downtime now starts.

8. Removed all Web Servers from Web farm exceptone.

This Web server continued to serve the"scheduled downtime" Web site. Because thisWeb site was not immediately rebooted, it istemporarily called the StaleWebServer in thesesteps.

9. Rebooted all the remaining Web servers to removeany cached or pooled connections.

10. Simultaneously changed the DNS connection aliasand reversed the database mirroring roles.

Changed the DNS connection alias to redirectthe application to the tempSQL2008 instance.For details about how we use DNS connectionaliases, see "Data Center Infrastructure" earlierin this paper.At the same time, we ran an SQL script tomanually fail over database mirroring,reversing the database mirroring roles andmaking tempSQL2008 the principal for alldatabase mirroring sessions.

11. Removed database mirroring.We ran a script to remove all databasemirroring sessions because tempSQL2008could not mirror to primarySQL2005. (A SQLServer 2008 instance cannot mirror to a SQLServer 2005 instance.)

12. Tested all systems with one of the rebooted Webservers.

We now took one of the rebooted Web serverscurrently outside the Web farm (call it theTestWebServer), and used it for testing theapplication that now connects to thetempSQL2008 database server. This was thefinal test to ensure that all applicationfunctionality was present when connecting tothe new tempSQL2008 database. If the testingfailed, we could have reverted back to the SQLServer 2005 instance (by issuing a RESTOREcommand on each of the SQL Server 2005databases on primarySQL2005 to bring eachdatabase from a loading, nonrecovered state

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 14 / 22

Page 15: High Availability and Disaster Recovery at ServiceU a SQL S

into a recovered state).This was effectively the Go/No-go decisionpoint. After we made the decision to allowusers back into the application and to connectto tempSQL2008, user updates to the databaseswould start. After that point, new data in theSQL Server 2008 database would be lost if wedecided to roll back to SQL Server 2005 basedon restoring from database backups.Because we made the decision to proceed, weput the remaining rebooted Web servers backinto the Web farm. We performed thefollowing two actions simultaneously and asquickly as possible:

Placed TestWebServer into the Webfarm, making it an active Web server.Removed StaleWebServer from theWeb farm, rebooted it in order toremove any cached or pooledconnections, and placed it back in theWeb farm.

All the Web servers were now active, in theWeb farm, and ready to connect totempSQL2008.

13. Redirected traffic (via the firewall) back to theapplication IP addresses.

The Web servers now were connecting totempSQL2008. At this point the system wasback up, and users were now able to use theapplication.This first downtime period lastedapproximately 10 minutes.

Phase 2: Redirected Application Users tothe Permanent SQL Server 2008 Instance atthe Primary Data Center

1. Built a new SQL Server 2008 cluster(primarySQL2008) at the primary data center.

Reconfigured the original primarySQL2005servers with Windows Server 2008 and SQLServer 2008, applying the appropriate driversand critical updates. Other IT personnelcontinued to monitor and test tempSQL2008,currently the production instance.Reconfigured the primarySQL2005 server'sLUNs on the SAN and reformatted them usingWindows Server 2008. We reconfigured theLUNs because we changed the number of disksfrom the older Windows Server 2003configuration. If reconfiguration had not beenrequired, just a Quick Format using WindowsServer 2008 to clean up the drives and maintainproper LUN disk partition alignment wouldhave been sufficient.Created the new Windows Server 2008 clusteras a three-node cluster (using an integratedinstall), and then installed SQL Server 2008Enterprise. We added the first SQL Server nodeusing the SQL Server Setup programinteractively, and added the other SQL Servernodes using Setup's command-line installationoptions. We found this faster than using Setupinteractively for all nodes. We then configuredthe SQL Server settings and tested a variety offailover situations to make sure everything wasfunctioning correctly.

2. Set up log shipping from tempSQL2008 toprimarySQL2008.

We were able to use backup compressionwhen setting up log shipping between thesetwo SQL Server 2008 instances, making the logshipping setup process faster when comparedwith the previous setup of up log shippingfrom primarySQL2005 to tempSQL2008.

3. Initialized asynchronous database mirroring fromtempSQL2008 to primarySQL2008.

Accomplished by using log shipping toinitialize database mirroring.Converted the mirroring sessions tosynchronous database mirroring, and waitedfor all m irror databases to synchronize.At this point, we were ready to move theapplication to the primarySQL2008 instance,but it required a second downtime period.

4. Used the firewall to redirect all incoming traffic tothe “scheduled downtime” Web page.

Users were effectively offline again at thispoint. The second downtime period starts.

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 15 / 22

Page 16: High Availability and Disaster Recovery at ServiceU a SQL S

point. The second downtime period starts.5. Simultaneously changed the DNS connection alias

and reversed the database mirroring roles.We changed the DNS connection alias at thedata center DNS servers to point connectionsto the primarySQL2008 server.At the same time, we ran a script to reverse thedatabase mirroring roles, makingprimarySQL2008 the principal andtempSQL2008 the mirror for all databasemirroring sessions.We then repeated the processes for testing theapplication, as well as rebooting all Webservers, as outlined in step 12.

6. Redirected traffic (via the firewall) back to theapplication IP addresses.

Users could now access the application, andthe Web servers were connecting to theprimarySQL2008 instance.Downtime duration for this second phase wasabout six minutes.At this point, the major part of the upgradeprocess was finished and the application wasnow using the desired primarySQL2008instance. The following steps in Phase 3 didnot need to occur immediately and no userdowntime was required.

Phase 3: Prepared a New SQL Server 2008Instance at the Standby Data Center andSet Up Database Mirroring to it from thePrimary Data Center

1. Prepared the standby data center SQL Server 2008instance (standbySQL2008).

We left mirroring from primarySQL2008 totempSQL2008 active temporarily, in case anyissues arose with the primarySQL2008 cluster.We then replaced the standbySQL2005 clusterwith new servers, installing Windows Server2008 and SQL Server 2008, as well as upgradingto a new SAN. This was part of a plannedequipment upgrade process.

2. Set up log shipping from primarySQL2008 tostandbySQL2008.

We again were able to use backupcompression to improve the speed of the logshipping setup process.

3. Established database mirroring to the standby datacenter.

Removed database mirroring fromprimarySQL2008 to tempSQL2008 instances.Removed log shipping and set upasynchronous database mirroring fromprimarySQL2008 to standbySQL2008.At this point, both data centers were live withSQL Server 2008 and the upgrade process wascomplete.

For several weeks after the upgrade, we left the databases inSQL Server compatibility mode 90. This allowed us totroubleshoot potential database issues without the additionalconcern of having changed to the new SQL Server 2008compatibility level as a factor in troubleshooting. After noissues were found, we changed the compatibility level of thedatabases to 100.

Patches and Cumulative UpdatesWe apply Windows and SQL Server patches to the mirrorinstance first, before the principal, and always during off-peakhours.

For SQL Server patches (hotfixes or cumulative updates), weuse the following process:

1. Start at the standby data center with the failovercluster hosting the mirror SQL Server 2008 instance.

2. Assume the cluster nodes are named Node 1, Node 2,and Node 3. Also assume that SQL Server is currentlyrunning on Node 1, and that Node 1 and Node 2 arethe preferred owners.

3. Run the patch installation on a node other than Node1, for example, Node 2. When the installation isfinished, reboot Node 2. Though perhaps notnecessary, this will remove any pending rebootrequirements on the server.

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 16 / 22

Page 17: High Availability and Disaster Recovery at ServiceU a SQL S

4. Run the patch installation on the other unused node,Node 3, and when finished, reboot Node 3.

5. Move the SQL Server 2008 resource group from Node 1to the other preferred node ( Node 2). This normallytakes 30-60 seconds, and is the only downtime in thisprocess.

6. Run the patch installation on Node 1, and whenfinished, reboot the node.

7. Verify that the SQL Server instance has the correctversion number for the patch by running SELECT@@VERSION on the SQL Server 2008 instance.

8. Repeat steps 2-7 for the principal SQL Server 2008instance (the failover cluster at the primary datacenter).

For Windows Server 2008 updates (including patches, drivers,and other software updates), we use the following steps:

1. Start at the standby data center, on the mirror instancefailover cluster.

2. Again, assume the cluster nodes are named Node 1,Node 2, and Node 3, and that the SQL Server 2008instance is running on Node 1.

3. Pause an inactive node, for example, Node 2.4. Install any updates and make any required changes.5. Reboot the node.6. Resume the node.7. Repeat steps 2-6 for Node 3.8. Move the SQL Server 2008 resource group from Node 1

to Node 2.9. Repeat steps 2-6 for Node 1.

10. Repeat steps 2-9 on the principal instance failovercluster at the primary data center.

Database and Application ChangesWe have a number of procedures in place to handle planneddowntime resulting from database and application changes.

To determine database schema and other major changes tothe databases, we use a third-party database comparisonutility to compare the production and final version of thedevelopment databases. The comparison utility generates aTransact-SQL script that changes the production databaseschema to the target database schema.

After the script is generated, we inspect it:

We ensure that the deployment script changes arecorrect, and we drill down into details ofchanges.If there is any potential for a table lock (due to a tableschema change, for example), we run the deploymentscript on a development system to determine theeffects of the changes.

A common scenario of a schema change is adding a newcolumn to a table. When creating the table, we ensure that theCREATE TABLE command includes default values for the newcolumn. We also ensure that the final change script modifiesany affected stored procedures and views to reflect the newcolumn. If a stored procedure is changed to reflect the newcolumn in an input parameter, we also initialize the defaultvalue of the parameter.

When a database schema change requires downtime, and weare ready to apply the script, we take the following steps:

1. Choose an off-peak time.2. Before applying changes, back up all database

transaction logs as simultaneously as possible.Because some databases are interrelated, this makesthe backup image of all of them as consistent aspossible.

3. Apply the changes. If the estimated downtime is lessthan 60 seconds, simply apply the changes withoutstopping applications from connecting to the SQLServer instance. Some users may see an error message.If the estimated downtime is greater than 60 seconds,redirect the applications to a friendly downtimemessage until the changes are complete.

4. Confirm the functionality of the database changes.The type of confirmation depends on how significantthe database changes are.

Most of our database changes occur in the context of achange to the application. In such cases, running scripts to

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 17 / 22

Page 18: High Availability and Disaster Recovery at ServiceU a SQL S

change to the databases must be coordinated with changes tothe application. Generally updates to the application can bedone in a matter of seconds.

In general, we have two strategies for deploying changes toWeb servers. In our application, Web content is replicatedusing Windows Distributed File System (DSF), so deploymentstrategies change depending on whether there are changes toWeb content or not. We perform the following steps using ateam of people:

When there are no changes to Web content, weremove all Web servers except one from the Web farm.Then we simultaneously deploy the code on one ofthe removed servers and to SQL Server. After testing,we swap these two Web servers by putting the onewith changed code into the farm and removing theexisting one. Then we deploy the new code to all theremaining servers and put them back into the farm.The case is more difficult if the content get replicated.In this case we go down to a single Web server, takingall of the other web servers out of the web farm. Weapply the changes to that Web server andsimultaneously apply the SQL Server changes. At thispoint, the Web applications and the SQL Serverdatabases should work together. We ensure that thenew Web content has replicated and then add theother servers back into the Web farm.

When only SQL Server changes are being deployed, wedetermine how long the SQL Server database changes willtake. If only stored procedures or views will be changed, andthere are no schema changes, the SQL Server changestypically finish in a matter of seconds. In those cases, it is notnecessary to redirect users to a "scheduled downtime" Website. If the SQL Server deployment is more time-consumingdue to schema changes, we will direct users to a "scheduleddowntime" site, as illustrated in "Steps for Upgrading to SQLServer 2008 from SQL Server 2005" previously, until thechanges are successfully deployed and verified.

Index MaintenanceWe rebuild and reorganize indexes in a selective and balancedmanner. Rebuilding or reorganizing indexes generates largeamounts of transaction log records, and those log recordsmust then be sent to the remote mirror. With asynchronousmirroring, such a condition can cause the mirror to fall behindthe principal significantly.

As a result, we allow index maintenance only during low traffictimes. We have a periodic Transact-SQL job to reorganize aswell as rebuild indexes. Both actions cause transaction logload that must be sent to the mirror. To reduce and even outthis load, the script does not pick all tables and indexes atonce, but spreads the task across multiple days and at lowusage times.

The script uses a threshold to determine whether to rebuild orreorganize an index, depending on the use of the table(lookup tables, for example, would not require frequentrebuilds) as well as fragmentation percentages. This is amultitenant system, and all customers have data in the sametables. Different usage patterns by different customers cancause some tables to require index rebuilding or reorganizing.

The script rebuilds indexes online whenever possible.Maintenance on tables that cannot be reindexed online isdone only during the lowest traffic times.

ConclusionServiceU has successfully implemented a sophisticated high-availability and disaster-recovery solution for our applications.Database high availability within each data center is achievedby placing a clustered SQL Server 2008 instance on a three-node Windows Server 2008 failover cluster. A three-nodefailover cluster maintains high availability during clusterpatches and upgrades. Disaster recovery is achieved using SQLServer 2008 database mirroring from a primary to a standbydata center. Thoroughly tested procedures are used to

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 18 / 22

Page 19: High Availability and Disaster Recovery at ServiceU a SQL S

maintain maximum availability with minimal downtimeduring both planned and unplanned downtime scenarios. Theapplication is continuously tested and scanned forvulnerabilities at the standby data center, ensuring it is alwaysready to go into production should the primary data centerbecome unavailable.

For more information, see the following documents:

Database Mirroring and Log Shipping WorkingTogetherHow to: Minimize Downtime for Mirrored DatabasesWhen Upgrading Server InstancesUsing Warning Thresholds and Alerts on MirroringPerformance Metrics

Appendix A. DataCenter InfrastructureDetailsServiceU has taken number of steps to ensure theinfrastructure at each data center is fully redundant andsecure.

PowerAll power is filtered to provide reliable current.Data centers have backup generators with large fueltanks. Both have emergency contracts in case of adisaster to assure continued fuel.Batteries provide additional backup in case of anydelays or problems with the generators.Power-switching equipment detects not only theavailability of utility company power, but the quality ofthat power before it switches back from the generator.Multiple electrical circuits are used to further mitigateany power risk. Each device's power supply isconnected to a different electrical circuit. Detailedpower diagrams help ensure that mistakes are notmade.All equipment, when available from the manufacturer,has redundant power supplies.

Air ConditioningMultiple air conditioning units provide redundanttemperature and humidity control.Capacity is oversized so that even if up to 50 percent ofthe units are nonfunctional, the data center stillmaintains acceptable temperature and humidity.

SecurityMultiple badges, or badge plus code access, arerequired to enter the facility.Badge access and video logs are kept for a minimumof 90 days (a PCI Data Security Standard (DSS)requirement).Both data centers maintain PCI Compliance 24x7x365.In this way, if we ever have to fail over to the standbydata center, we do not have to make any changes tobe PCI compliant. It is a PCI Compliance requirementthat a standby data center facility also be compliantbefore beginning to process any transactions.Key access is required to access the servers.Servers use a password protected KVM.Servers lock after 15 minutes of inactivity.

Offsite BackupsDatabases and transaction logs are backed up to diskfiles. Those backup files are included in the daily tapebackups that are transported offsite and stored in aclimate-controlled vault. In the case of a disaster,those tapes are flown to the nearest major airport.They are then shipped overnight to the standby datacenter. While the tapes should never be needed, this isan additional layer of protection.

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 19 / 22

Page 20: High Availability and Disaster Recovery at ServiceU a SQL S

Satellite PhonesKey members of the company carry satellite phones atall times in the case of a disaster. This allows them tocommunicate with vendors, service providers, andother employees.Key service providers have the satellite phonenumbers.

FirewallsWe use firewalls in a passive high availabilityconfiguration: Active/Passive firewalls – If a firewallfails, the other firewall takes over with no loss ofconnectivity to the client. The client never realizes thatthere was a problem.The firewalls share state, so if one goes down or has tobe rebooted, the user should not notice any packetloss or connectivity problems.

SwitchesAll switches have very few moving parts (this is part ofthe company specification).All have redundant power supplies.

BackupWe have established the following backup procedures:

Databases are backed up daily.Database transaction logs are backed up hourly duringthe active part of the day.Backups are made to disk and kept on disk for threedays.Tape backups are stored off-site and on a regularrotation.

Appendix B.Database ServerAdditionalInformationServiceU places database data and transaction log files onseparate physical LUNS of the SAN, not on virtual LUNs. Afterextensive testing of our application's data access behavior, wefound we could achieve slightly better overall performance byplacing the tempdb data files and its log files on the databaseserver's data LUN. Disk spindles previously used for a separatetempdb LUN were then reallocated to the database server'sdata LUN.

Additional SQL Server 2008 database server decisions we madeinclude:

We found that giving each database server 64 GB RAMdramatically improved performance.Our application requires MSDTC for some middle-tiertransactions. Microsoft does not support MSDTC withdatabase mirroring, but we do not use MSDTC in away that would lead to non-supported behavior.We have found that setting MAXDOP to 1 at the serverlevel has been best for online transactional processing(OLTP) performance. On occasion we may set it to 0temporarily for large data changes such as rebuildingindexes, to take advantage of parallelism.We use a Policy-Based Management (PBM) server tocentralize policies. Policies are pushed out to theinstances of SQL Server via the PBM Server and theWindows PowerShell® command-line interface.

Appendix C.Documentation

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 20 / 22

Page 21: High Availability and Disaster Recovery at ServiceU a SQL S

DocumentationProceduresServiceU has implemented documentation procedures thatare crucial to both SLAs and high availability.

Disaster Recovery ProcessOur disaster recovery failover steps, including SQLscripts, processes, and other important data aredocumented and ready at the standby facility to use ata moment’s notice. It is simple for someone underpressure to run these scripts and activate the standbyfacility.

Configuration Documents, Steps,and Diagrams

Configuration documents exist for all equipment.These are easy to use, yet complete, and they helpachieve consistency across the enterprise.These also make sure that infrequently changedequipment is always configured correctly. A goodexample is the configuration of fiber switches for SQLServer clusters.We maintain an internal knowledge base (KB) system,which is used extensively to share information anddocument core knowledge.

PoliciesWe have a policy that no updates are applied pastnoon on Thursday. This helps to prevent errors orproblems occurring over the weekend, when theresponse time by IT personnel may be longer.Extensive code reviews and testing processes ensureaccuracy of code before its deployment to theproduction environment.

Appendix D. ScriptsThe cf_IsMirrorInstance() Function

CREATE FUNCTION [dbo].[cf_IsMirrorInstance] ()RETURNS bitASBEGIN -- This function determines whether a server is the mirror -- instance or the principal. -- Assumption: All databases reside at either one location -- or the other. -- Replace <db name> with an actual database name. DECLARE @mirroring_role_desc nvarchar(60) DECLARE @IsMirrorInstance bit -- Choose a single critical database and test to see whether -- it is PRINCIPAL or MIRROR -- Because the databases are interrelated, all must be the same; -- other jobs test for this SELECT @mirroring_role_desc = mirroring_role_desc FROM sys.database_mirroring m JOIN sys.databases d ON m.database_id = d.database_id WHERE d.name = '<db name>' -- Evaluate the result IF (@mirroring_role_desc IS NULL) OR (@mirroring_role_desc = 'PRINCIPAL') SET @IsMirrorInstance = 0 ELSE SET @IsMirrorInstance = 1 -- Return the result RETURN @IsMirrorInstance

Transact-SQL

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 21 / 22

Page 22: High Availability and Disaster Recovery at ServiceU a SQL S

TOOLS

Visual Studio

Expression

ASP.NET

Silverlight

PLATFORMS

Visual Studio

Windows

Windows Phone

Windows Azure

Office

SERVERS

Windows Server

Exchange Server

SQL Server

Biz Talk Server

Data

DEVELOPER RESOURCES

MSDN Subscriptions

MSDN Magazine

MSDN Flash Newsletter

Code Samples

MSDN Forums

GET STARTED FOR FREE

MSDN evaluation center

BizSpark (for startups)

DreamSpark (for students)

School faculty

© 2013 Microsoft. All rights reserved.Newsletter | Contact Us | Privacy Statement | Terms of Use | Trademarks | Site Feedback

Determining the Current State of aMirror DatabaseServiceU uses the following query as part of a larger SQLServer Agent job script that alerts IT personnel when databasemirroring states have a value other than SYNCHRONIZED orSYNCHRONIZING. The query lists the actual database names,the database state, the mirroring session state, and themirroring role of the database. When IT personnel receive thealerts, they can quickly see which databases are havingproblems and the current mirroring state of each.

Appendix E.FeedbackDid this paper help you? Please give us your feedback. Tell uson a scale of 1 (poor) to 5 (excellent), how would you rate thispaper and why have you given it this rating? For example:

Are you rating it high due to having good examples,excellent screen shots, clear writing, or another reason?Are you rating it low due to poor examples, fuzzyscreen shots, or unclear writing?

This feedback will help us improve the quality of white paperswe release.

Send feedback.

Did you f in d th is h elp fu l? Yes No

END

SELECT d.name, m.mirroring_state_desc, d.state_desc, mirroring_role_descFROM sys.database_mirroring mJOIN sys.databases d ON m.database_id = d.database_idWHERE mirroring_guid IS NOT NULLAND mirroring_state_desc NOT IN ('SYNCHRONIZED', 'SYNCHRONIZING')ORDER BY d.name ASC

Transact-SQL

High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study 05/03/2013

http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 22 / 22