Role-Based High Availability with Exchange 2007

Role-Based High Availability with Exchange 2007

Jim McBee

http://www.ithicos.com

http://www.ithicos.com/

Who is Jim McBee!!??• Consultant, Writer, MCSE, MVP and MCT –

Honolulu, Hawaii• Principal clients (Dell, Microsoft, SAIC, Servco

Pacific)• Author – Exchange 2003 Advanced Administration

(Sybex)• Contributor – Exchange and Outlook Administrator• Blog

– http://mostlyexchange.blogspot.com – http://www.directory-update.com

http://mostlyexchange.blogspot.com/

http://mostlyexchange.blogspot.com/

http://www.directory-update.com/

Agenda

• High availability versus fault tolerance

• Resiliency versus high availability

• Server roles

• Providing higher availability

• Continuous replication technologies

Fault tolerance

• Designing and building a server that is resistant to failure

• All servers should be fault tolerant• RAID disks• ECC memory• Redundant power supplies• UPS systems• Active Directory and DNS

High availability

• Components of your system that allow quicker recovery from a failure

• Examples include…– Clustering– Load balanced hosts– Built-in redundancy or load balancing – DNS / application redundancy or load balancing

Resiliency

• Solutions that allow for contingency of operations

• Recovery in the event of a serious disaster• Not solutions that are invoked when applying

a service pack or a quick power outage• Usually not automatic failover• Examples include…

– Standby Continuous Replication– Local Continuous Replication

Server roles

Roles configured at installation• Simplify installation

– Optimize the server for the jobs it performs

– Increase availability through the most efficient and economic means

– Manage the servers more intuitively

Exchange 2007 Server RolesBy defining well-described roles, we can:

– Remove unnecessary functionality– Reduce the attack surface

• Benefit: optimize server performance• Benefit: reduced exposure in the perimeter

EdgeTransport

Server

HubTransport

Server

Mailbox Server

ClientAccessServer

UnifiedMessaging

Server

Perimeter Network Protected Network

Server Roles 1/5• Edge Transport

– Must be on its own separate physical machine – No other roles installed– May be workgroup member or joined to an Active

Directory domain– Uses Active Directory Application Mode (ADAM) for

configuration and recipient information– Perimeter policy enforcement– Message hygiene

• Anti-spam• Transport anti-virus

• Not Required

Server Roles 2/5• Client Access Server (CAS)

– Supports Outlook Web Access, Exchange ActiveSync, Outlook Anywhere (formerly RPC/HTTPS), POP3 and IMAP4 protocols, Autodiscover, Availability, and Web services

– At least one CAS in each Active Directory site and domain where mailbox servers exist

– Requires good network connection (low latency) to mailbox servers

– Uses RPC communication to mailbox server

Server Roles 3/5• Hub Transport

– Handles message delivery and routing (see EX03)

– Applies policies to incoming and outgoing mail– Can handle message hygiene functions– Reduces cost and complexity

• Provides more predictable routing• Reduces downtime

Server Roles 4/5• Mailbox

– Responsible for serving mailbox databases and public folders

– Mailbox access through MAPI– Possible to require MAPI encryption– Possible to run without public folders

Server Roles 5/5• Unified Messaging

– Placed in the protected corporate network– Requires that Mailbox and Hub Transport roles

exist– Check with your phone vendor to see if their

phone system will work with UM server• May require PBX gateway

Things to Consider• Interdependencies

– Mailbox servers require the Hub Transport role for message delivery – even to the same database

– The CAS roles provide OWA, ActiveSync, RPC over HTTP, the Availability Service, Autodiscover, and more

– The Edge role requires a Hub Transport server

• Fault tolerance– Mailbox servers can only talk to Hub Transport servers in the same

Active Directory site– Mailbox servers will talk to Hubs on the same server before other

Hubs in the same Active Directory site– For proxy & re-direct scenarios CAS connects to "best" CAS

• CAS not the same as FE servers

High availability

Focus on Availability and Resilency

• Improve data availability and resiliency– Protect mailbox data from failures and corruptions– Reduce time required to restore mailbox data– Provide data redundancy

• Service availability– Make mailbox data more available– Make cluster failover less painful– Make cluster management easier– Support for ‘stretch’ or ‘geo-clusters’– Allow large mailboxes inexpensively

Hub Transport ServerHigh Availability Options

• Use redundant hardware• Automatically load balanced and redundant with multiple Hub

Transport servers• Inbound SMTP mail

– Direct delivery to Hub Transport from Internet– Direct delivery to Hub Transport from 3rd party SMTP system– Load balancing

• Third party load balancing• Windows Network Load Balancing (NLB)

• Server failure will result in failure of current connections• May result in some data loss for any messages in the Hub

Transport Server queue database

Client Access ServerHigh Availability Options

• Redundant hardware– Windows NLB or third party load balancing– Round robin DNS (not the best solution)

• Server failure will result in current connections being lost– User may need to re-establish connection

Unified Messaging ServerHigh Availability Options

• Redundant hardware– Windows NLB or third party load balancing– Round robin DNS

• PBX or Gateway redundancy– Some PBXs may have load balancing options for multiple UM

servers• Server failure will result in any loss of current connections or

call transfers in progress

Mailbox ServerHigh Availability and Resiliency Options

• Resiliency and recoverability– Local continuous replication (LCR)– Standby continuous replication (SCR)

• Requires Exchange 2007 SP1

• High availability– Cluster continuous replication (CCR)– Single copy clusters (SCC)

• CCR and SCC require dedicated servers– No other roles can exist on a clustered node except Mailbox– Other roles must be on their own hardware

• Changes to transaction log files– 1MB in size– Log file is completely written after 15 minutes– Checkpoint depth is still 20MB / Storage Group

Shared Copy Clusters

• Requires Microsoft Cluster Services

• Benefits– Improved Exchange Cluster

setup– Traditional clustering used today– Failovers use the same data

copy

• Exchange Virtual Server = Clustered Mailbox Server

• 2 to 8 node Active / Passive clusters

Q

DB

Logs

MB

SCC Caveats• Requires expensive hardware with shared

storage

• Can be complicated for admins to learn

• Doesn’t protect from storage/data issues

• Let Servers must be on same IP subnet– Data redundancy provided through partners

• Hardware must be in the Windows Server Catalog

Local Continuous Replication• Additional copy of the logs and database

– On the same server– On a different volume

• Benefits– Easy configuration– Single datacenter– Doesn’t require expensive hardware– Online backups– Very quick restoration of service

• Caveats– Adds additional CPU/memory/disk overhead– Initial seeding required– Manual activation– Additional storage requirements– One database per storage group

DatabaseDatabase Logs

Copy and verify logs

D:\SG1\Logs

E00.logE0000000012.logE0000000011.log E0000000012.log

E0000000011.log

Advance database by playing logs

Enable LCRUpdated database

D:\SG1\Copy\Logs

Logs

Local Continuous ReplicationLocal Continuous Replication

Local Continuous Replication Tips• One database per storage group• Plan for additional hardware resources

– Minimum 20% additional CPU overhead– Additional 1GB of RAM– Will more than double IOPS requirements

• Maximum database size approximately 2GB• Separate storage into LUNs

– Do not break LUNs in to separate partitions– Put each database on a separate LUN– Isolate active and passive LUNs

• Use battery backed up storage controllers– Configure caching controllers for 75% write / 25% read

• LCR activation is manual– Use Restore-StorageGroupCopy cmdlet– Use backup copy “in place” or move it

Local continuous replication

Clustered Continuous Replication

• Benefits– Potentially no single point of

failure– Two copies of the data on

separate servers– No need for shared / SAN

storage.– Full redundancy with

automatic recovery– Backup mailboxes without

disturbing production– Doesn’t require validation for

clustered configuration

Witness

DB DB

DB DB

Log

sLo

gs

Log

sLo

gs

FileShare

KB KB 921181

CCR Advantages

• No single point of failure• Fast recover• Simplified hardware and storage requirements• Simplified deployment• Out-of-the-box replication solution• Can “stretch” the cluster to a second data center• Ability to offload VSS-based backups to passive

node• Can integrate with SCR

CCR Caveats• Requires Microsoft Cluster Services

– Majority Node Set cluster– Requires a third “voting” node - uses a shared folder

• Two-node, Active/Passive only• Backup:

– Streaming backup against production storage groups– VSS backup against production and replica storage groups

• Limit of one database per storage group• Can be used for PF database if it is the only PF database in the

organization• Initial database seeding required• Servers must be on same IP subnet• Transaction logs pulled over SMB shares• Some scenarios required log validation, replay• Database failure does not cause failover

Standby Continuous Replication• Coming in Service Pack 1• Source and target

machines can be– Stand-alone– In two different MSCS

clusters– On different subnets

• Controlled per storage group

• Many-to-one and one-to-many supported

• Manually activated

Replication to a standby server

DB

DB

Log

s

Log

s

LCR versus CCR versus SCR• LCR

– Focused towards resiliency – Improve restore time– Administrator has to initiate restore manually– Single data-center solution– Implements log shipping and replay out of the box

• Log files are copied locally and replayed

• CCR– Targeted towards site resiliency– Automatic failovers– Single or two-data center solution– Supports “stretch” option– Implements log shipping and replay out of the box

• Log files are copied to remote server and replayed– Simplifies cluster deployment

• No SAN or shared storage• SCR

– Provides site and server resiliency– “Cold spare” approach cuts hardware costs– Can be combined with LCR, CCR, and SCC for maximum flexibility

Continuous Replication Basics

• Exchange store runs normallyExchange store runs normally

• Replication service keeps a copy of the Replication service keeps a copy of the database up-to-datedatabase up-to-date• Copies, inspects, and replays log filesCopies, inspects, and replays log files

• In CCR, Cluster service provides failoverIn CCR, Cluster service provides failover• Move network identity (client transparency)Move network identity (client transparency)

• LCR activation is manualLCR activation is manual• Restore-StorageGroupCopy taskRestore-StorageGroupCopy task

Continuous Replication Basics• A ‘pull’ modelA ‘pull’ model

• Exchange server creates log files normallyExchange server creates log files normally

• Log files are copied by Replication serviceLog files are copied by Replication service• EExxnnnnnnnnxxnnnnnnnn.log files copied as they appear.log files copied as they appear

• EExxxx.log is copied for handoff/failover.log is copied for handoff/failover• If it can’t be copied loss setting (AutoDatabaseMountDial) If it can’t be copied loss setting (AutoDatabaseMountDial)

is consultedis consulted• Lossless (0 logs lost)Lossless (0 logs lost)• GoodAvailability (3 logs lost)GoodAvailability (3 logs lost)• BestAvailability (6 logs lost – default setting)BestAvailability (6 logs lost – default setting)

Continuous ReplicationContinuous Replication

SourceDB

InspectorDirectory

TargetLogDirectory

DBCopy

Store

ReplicationServiceSource

LogDirectory

ReplicationService

ReplicationService

Continuous ReplicationContinuous Replication

SourceDB

InspectorDirectory

TargetLogDirectory

DBCopy

Store

Source LogDirectory

LastLogCopyNotified LastLogCopied

LastLogInspectedLastLogReplayed

ReplicationService

ReplicationService

ReplicationService

Continuous Replication Monitoring

LastLogCopyNotifiedLastLogCopyNotifiedLast generation seen in the source directoryLast generation seen in the source directory

LastLogCopied LastLogCopied Last generation copied to Inspector directory by Last generation copied to Inspector directory by Replication serviceReplication service

LastLogInspectedLastLogInspectedLast generation inspectedLast generation inspectedMoved to log file directoryMoved to log file directory

LastLogReplayedLastLogReplayedLast generation replayed into the database copyLast generation replayed into the database copy

Available through Performance MonitorAvailable through Performance Monitor

Divergence• When the copy has information not in the original it When the copy has information not in the original it

is divergedis diverged• Divergence may be in database or log filesDivergence may be in database or log files

• Lossy failover will produce a divergenceLossy failover will produce a divergence• ‘‘Split-brain’ on a cluster also causes divergenceSplit-brain’ on a cluster also causes divergence

• Even if clients can’t connect, background maintenance still Even if clients can’t connect, background maintenance still modifies the databasemodifies the database

• Administrator error can cause divergence!Administrator error can cause divergence!• e.g. running eseutil /re.g. running eseutil /r

Recovering from Divergence• Re-seed will always workRe-seed will always work

• Expensive for large databasesExpensive for large databases

• Look at the common caseLook at the common case• Lossy failoverLossy failover

• Only a few log files are lostOnly a few log files are lost

• Built-in solutionsBuilt-in solutions• Decreased log file size to reduce data lossDecreased log file size to reduce data loss

• Lost Log Resilience (LLR)Lost Log Resilience (LLR)

• Feature built into the Hub Transport server roleFeature built into the Hub Transport server role• Runs to redeliver mail to CMS’ in its SiteRuns to redeliver mail to CMS’ in its Site

• Uses the creation time of the last log file copiedUses the creation time of the last log file copied• CCR only in RTMCCR only in RTM

• Use Set-TransportConfig to change default settings (setting Use Set-TransportConfig to change default settings (setting is organization-wide)is organization-wide)• Set MaxDumpsterSizePerStorageGroup be to Set MaxDumpsterSizePerStorageGroup be to 1.51.5 times the size of times the size of

the maximum message that can be sent (default value is 18MB)the maximum message that can be sent (default value is 18MB)• Recommend MaxDumpsterTime be Recommend MaxDumpsterTime be 7.00:00:007.00:00:00, which is seven days , which is seven days

(default value)(default value)

Transport Dumpster

Backups from Passive Database

• Backing up the passive moves the Backing up the passive moves the performance hit off the activeperformance hit off the active

• Backup the active or the passive?Backup the active or the passive?• Remember, they can change designationsRemember, they can change designations

• Passive backup is VSS onlyPassive backup is VSS only• Data Protection Manager v2Data Protection Manager v2

• Active backup can be VSS or streaming ESEActive backup can be VSS or streaming ESE

Questions?

Thanks for attending!

Book giveaway and e-mail notice

• Please give me a piece of paper with your name for drawing

• Include your e-mail address or give me a business card if you want:– 20% discount code for

Directory Update software– Notification e-mail when

Mastering Exchange Server 2007 is available

Documents

Role-Based High Availability with Exchange 2007