Automated Disaster Recovery With BMC Atrium Orchestrator

Applying the capabilities of IT Process Automation to help meet the daily

challenges faced by Disaster Recovery / IT Service Continuity Professionals

BEST PRACTICES WHITE PAPER

Automated Disaster Recovery With BMC Atrium Orchestrator

ConTEnTS

IntroductIon 1

the challenges 1

automatIon drIves BusIness recovery 1

modular approach to dIsaster recovery automatIon 2

Fast and eFFIcIent communIcatIon 3

the daIly ‘dIsasters’ 3

loss oF a data center – “actIve-actIve” 4

loss oF a data center – “actIve-passIve” 4

BeneFIts and showIng value 4

cost oF downtIme 5

cost oF hardware 5

cost oF staFF 5

rIsk oF dIsaster recovery plan drIFt 5

rIsk oF InFrequent testIng 5

the complete Bmc solutIon 6

conclusIons 6

1

InTRoduCTIonEvents such as 9/11, Hurricane Katrina, and other recent, high-profile disasters have brought a harsh realism to the potential

devastation that can impact any of us unexpectedly, at a moment’s notice. And while the concern for human tragedy will always

remain the central one during such times, there is a growing understanding of — and concern for — the impact to businesses,

as well. Indeed, with the rise in importance technology plays toward delivery of core business services, planning and preparing

for disaster scenarios has taken on its own essential role to ensure continuity of business services.

Disruption of business services, whether the result of minor technology failures that occur daily or full-scale disasters, can

be highly detrimental to a company’s financials, as well as to its reputation. Once you acknowledge the value technology has

to your organization, you must also consider the related consequences when and if that technology becomes temporarily

unavailable or more severely impaired due to catastrophic failure.

Business continuity planning is used to identify needs, analyze consequences, and develop recovery strategies specifically

designed to ensure operational continuity at a minimal level or standard in the event of a disruption to the business. Such

events come in a variety of shapes and sizes, but as the old adage goes, they always tend to come at the most unexpected

and inopportune times. The full impact of such events on production services is, as a rule, directly dependent on the speed

and efficiency with which IT Operation’s can detect a disruption, triage the impact on services, and then execute the

recovery process.

THE CHAllEngESThe common reality across IT is that, even today, detection is left to existing piece-part monitoring tools, and the overall

Disaster Recovery Lifecycle is managed in an ad-hoc fashion, manually developed on the fly. This often results in “less than

optimal” recovery times which will have a direct impact to an organization’s bottom line.

In some cases, the disaster recovery process may involve automated scripts that are triggered manually, in conjunction with

step-by-step instructions codified in procedural documents. Even still, the execution of the disaster recovery processes are

highly manual, difficult to coordinate, and cumbersome to achieve.

Business continuity managers also face challenges around testing and maintaining the currency of their disaster recovery plans

and procedures. Although many business continuity managers now sit on change review boards, it is certainly not uncommon

that some changes slip through the net, thus rendering the disaster recovery plan out of date and ineffective. This disaster

recovery plan “drift” problem is exacerbated due to the infrequency in which testing is done. Most organizations only test their

disaster recovery plans once a year (if at all!) and this is usually a very expensive operation requiring tens if not hundreds of

skilled IT staff working over nights and weekends.

AuTomATIon dRIvES BuSInESS RECovERyBMC Atrium Orchestrator is an enterprise-class automation platform that speeds and simplifies the process of developing

and maintaining business recovery scenarios and, most importantly, restoring service. BMC Atrium Orchestrator automates

any set of repeatable manual tasks and scripts as a workflow, ensuring speed, accuracy, and consistency of execution.

Through the BMC Atrium Orchestrator Development Studio, business recovery scenarios can be rendered, maintained, and

executed as repeatable workflows — all from a single location that is accessible globally. Whether recovery scenarios call for

multiple database restores, porting multiple applications to backup data center servers, reconfiguring SAN-based storage, or

simply notifying large numbers of people quickly, a BMC Atrium Orchestrator workflow can be built and executed to automate

the required tasks.

To provide a more complete solution, BMC Atrium Orchestrator also leverages the capabilities provided by other enterprise

management solutions. For example, event management solutions have become much more effective at identifying and

associating technical problems with impacts on business services. This maturity enables rapid identification of service-

impacting problems, which can then be picked up by the automation tool as a trigger to quickly drive the initiation of a

disaster recovery plan.

2

As an example, the BMC ProactiveNet Performance Management solution provides a real-time view of available business services

and the relative priority and importance of those services to the business. The underlying service models can accurately assess the

impact to any supporting technical component with the overall availability and performance of a business service. The additional

benefit of the solution’s service impact management functionality is that the encapsulated service models are generated from

information within the BMC Atrium CMDB, meaning that so long as the information in the CMDB is current (through proper change

processes), the service models are also automatically kept current, which helps alleviate some of the issues associated with

disaster recovery plan “drift”.

Figure 1: BMC ProactiveNet Performance Management – Business Service Impacted

BMC Atrium Orchestrator complements and easily links to other existing management systems, IT service management

solutions, and individual infrastructure devices enabling centralized control and visibility across an entire technology

infrastructure. BMC Atrium Orchestrator can execute recovery scenarios in a fully automated mode, semi-automated mode

(e.g. generate change requests/e-mails and gather authorizations before initiating a disaster recovery plan), or can be used

as a centralized interface from which an operator can run individual recovery scenarios as necessary.

modulAR APPRoACH To dISASTER RECovERy AuTomATIonFully automating all disaster recovery processes is not something that tends to happen overnight, so an incremental approach

is required. There are various levels of disaster recovery protection that provide building blocks to help form a more complete

end-to-end solution.

Component Protection: Most hardware (compute, network, and storage) provide some form of high-availability/ »clustering technology which is the first line of defense against failures,

Application / Business Service Protection: Business services can be monitored and managed as a logical entity. »Each business service can have its own set of recovery processes that can be initiated in isolation to any other

business service.

Site / Data center Protection: In extreme circumstances, loss of a location can be a risk which organizations need to »protect themselves against. In these cases, a much more complex and coordinated plan is required where multiple

business services are prioritized and recovered to a remote site.

3

A key factor when approaching disaster recovery is cost. As in many situations, there is a cost-vs-risk balance that needs to be

considered as part of any plan that is put together. What is the acceptable downtime for a component or business service and

what will the impact be to the business? How much are you willing to invest to keep downtime to a minimal?

Again, various tiers of protection can be implemented for each component or business service. For high-priority services,

an “active-active” approach may be acceptable. Although costly to implement, dedicated hardware and constant data

synchronization between remote sites can enable an extremely fast recovery process.

In other cases, dedicated hardware is too expensive, so an “active-passive” configuration may be more appropriate. This is

where a plan is put in place to utilize test or development platforms in the event of a disaster and reconfigure these systems

to manage the production environments. Typically, the time taken to recover “active-passive” configurations is longer, and the

steps taken to implement failover are more complex and risky.

There are also considerations about actual ownership of disaster recovery hardware. These days, organizations may choose

to have their own dedicated secondary data center or they may rent space from a specialist disaster recovery service provider.

With the advent of cloud computing, there are further options, which enable organizations to build their recovery plans utilizing

hosted resources in a public cloud.

FAST And EFFICIEnT CommunICATIonRegardless of the nature of a disaster, there is always a need to communicate quickly and effectively to all employees who

may be impacted — whether critical IT personnel needed to restore and verify services, users impacted by outages, or staff

required to report to an alternate site. In each of these situations, BMC Atrium Orchestrator can be the single point of control

and execution of communications. Workflows that interact with voice systems can also be executed to establish bridges for

announcements, call out to critical resources as a part of recovery to notify responsible IT personnel, or deliver infrastructures

to employees based on the nature of the disaster.

THE dAIly ‘dISASTERS’While we may not think of the outages that occur on a daily basis as disasters, they are certainly events that disrupt business

services; and, as discussed, the procedures for restoring service from these daily outages should be leveraged for larger events

that can, in many cases, truly be classified as disasters. Loss of a database environment is an event that can occur at any time

— as well as during a major disaster. While applications and networks may be working fine, without the database, the business

service is disrupted. BMC Atrium Orchestrator workflows can be written to execute database fallback or restore scenarios in

support of any or all environments. In the event of a loss of a database or larger event that affects multiple databases,

BMC Atrium Orchestrator workflows can be accessed and executed from any location to restore database services.

BMC Atrium Orchestrator workflows can be designed to execute in a fully automated fashion or interactively. In this scenario,

the BMC Atrium Orchestrator recovery workflow would intercept an event, such as an SNMP trap or similar notification from

an event management system, and based on the event type, automatically execute steps to:

Verify that a problem exists and that the event was not part of a planned outagea.

Determine what other resources may be affected and require attentionb.

Document the current state by opening a service desk incident or generating and distributing a reportc.

Provision hardware and software resources to replicate the operating environmentd.

Recover the data to the status quo antee.

Restore associated resourcesf.

Close the incident or update the reportg.

4

Figure 2: Example Disaster Recovery Workflow

While this example depicts a fully automated recovery process, it would be just as easy to insert pauses in the workflow to report

progress-to-date and request operator confirmation of next steps to perform. You can see how this isolated event-and-recovery

process can be incorporated into a larger process to recover from an outage that affects an entire end-to-end business service.

loSS oF A dATA CEnTER – “ACTIvE-ACTIvE”Many IT organizations employ a dual data center strategy where business services are running live in two hot data centers.

Both data centers are setup to run all critical business services, and at any point in time, services are running live in one of the

two data centers. Various types of events can result in what is operationally the loss of a data center. Events, such as power

loss, building destruction, or a disaster that impacts telecommunications services, result in business services in the data center

becoming unavailable. The first step in this situation is to determine what was lost and which databases and applications were

running in the ‘failed’ site. Once that has been determined, procedures can be executed to restore business services in the

operational data center. BMC Atrium Orchestrator workflows can be written and executed to identify those services that were

running in the failed site and provide fast guidance as to what requires recovery in the operational site. Once determined,

BMC Atrium Orchestrator workflows can be used to execute the appropriate recovery scenarios to quickly restore service.

loSS oF A dATA CEnTER – “ACTIvE-PASSIvE”In some environments, running a secondary hot data center is not practical. These IT organizations typically employ a cold

or warm backup site that contains the IT infrastructure components to recover critical business services, but does not keep

database and application infrastructure environments running. In this case, BMC Atrium Orchestrator workflows can be written

for each specific business service that executes the tasks to load and bring up application environments, applications, and

databases in backup data center.

BEnEFITS And SHoWIng vAluEBusiness Continuity managers will often also want — or be required to — show the value that automation provides. Executives will

want to see the returns on any investment made in automation technology or understand the level to which they have mitigated risk.

Typically the three big indicators around cost for a disaster recovery plan are

Cost of downtime of a service to the business (can be both a financial cost and impact to business reputation.) »Cost of hardware / real estate to implement plans »Cost of staff to test or, if necessary, implement disaster recovery plans »

5

There are also risk factors to consider

Currency of disaster recovery plan and procedures »Frequency in which they can be tested. »

Automation can help in all of these areas.

CoST oF doWnTImEThe biggest, measurable benefit of automation is likely to be around the time taken to recover a service. If the loss of a critical

business service can cost a business $500k an hour, reducing the recovery window from 4 hours to 30 minutes is a very

compelling story.

CoST oF HARdWAREWith the advent of virtualization technologies, there is much greater flexibility in the use (or not) of dedicated physical hardware.

Virtual images can now easily be copied and migrated between physical hypervisor hosts and rebooted and reconfigured on

the fly using automation technology. Not only does this speed the recovery time in an “active-passive” situation, but existing

hardware running non-production virtual images can quickly be re-purposed to host the production environment to quickly

restore service.

CoST oF STAFFIn non-automated environments, huge swathes of expensive, experienced IT engineers are required to properly test or

implement a disaster recovery plan. In larger environments, testing can involve hundreds of staff, working over a weekend.

Without automation, a “disaster recovery Playbook” which describes the recovery procedures is walked through step by step

by many different IT teams (Network, UNIX®, Oracle® DB etc). After each step, the IT team with responsibility for the next step in

the playbook needs to be contacted. Then, that team needs to notify the next team that their steps have been completed, and so

forth. Automation would manage both the communication and orchestration of a recovery plan, vastly reducing the number of

people required to either test or initiate the disaster recovery plan.

RISk oF dISASTER RECovERy PlAn dRIFTA common problem, which is often exacerbated by the infrequency of testing, is that disaster recovery plans quickly become

out of date. Perhaps part of a service is moved to a new server or additional load balancers are added to make the service more

resilient. In either case, you will get one of two things happening.

If a service does go down and the disaster recovery plan is initiated, at some point during the recovery process, something a.

isn’t going to work, which will add time to the recovery window and impact the business whilst the error is tracked down.

You will get false notifications of disasters when really the business services are functioning just fine. In most cases, you’d b.

expect confirmation of a disaster before any plan would be initiated, so these kind of issues should cause limited exposure.

Still, it is an unwanted distraction.

The solution in both of these cases is tight change and configuration controls and, again, automation can play its part in ensuring

these processes are always executed as part of any infrastructure updates. BMC Atrium Orchestrator has specific runbooks

which integrate with server, network, and database configuration tools which automatically generate change tasks for any

updates made, thus keeping an accurate audit record of change and also keeping the CMDB up to date. In the BMC ProactiveNet

Performance Management example, this would in turn maintain the business service models and impacts of technology faults

on supporting configuration items ensuring that the monitoring / alerting mechanism is also current and accurate.

RISk oF InFREquEnT TESTIngAs in the section above, infrequent testing of a disaster recovery plan results in inaccuracies in the plan. Whereas before

automation a disaster recovery plan could take tens or even hundreds of staff many hours to test, automation could test the plan

in a fraction of the time using far fewer people. The combination of fewer people and much faster testing times means that plans

can be tested on a much more frequent basis, greatly reducing the risk of out of date plans.

6

THE ComPlETE BmC SoluTIon

Site A

Trading Service

Site B / CLOUD

Trading Service

Atrium Orchestrator

CMDB / CMS

Service Model

Business Service User1

2

34

5

6

Authorizations Decision makers confirm disaster (Example: Change requests generated in Remedy ITSM – wait on ap provals)Failover Process (e.g.) Shutdown what's le� of production environment Re-allocate resources at DR site Data synchronization Restart service at DR site Re-direct users

1

2

3

4

5

6

Service Model generated from CMDB

Real time monitoring of service model through BMC ProactiveNet Performance Manager

Service Impacting event causes service outage / failure

Service Impact alert picked up by BMC Atrium Orchestrator

Atrium Orchestrator initiates associated DR workflow.

Service is automatically recovered at secondary site and service resumed.

ConCluSIonSThe disaster recovery and business continuity processes in place at most companies typically consist of written procedures

augmented by traditional systems management tools for recovering IT resources. This fragmented approach extends recovery

time and hinders continuous process improvement initiatives. BMC Atrium Orchestrator provides a single point of visibility and

control for executing business recovery in the event of minor ‘daily disasters’ or major events that disrupt business services.

It provides immediate value by allowing you to incrementally build-out your recovery processes, automating key recovery

processes first, until you have a fully integrated end-to-end process. And by implementing automation, you get a reliable,

repeatable process that will serve as the foundation for continuous process improvement.

Bmc, Bmc software, and the Bmc software logo are the exclusive properties of Bmc software, Inc., are registered with the u.s. patent and trademark office, and may be registered or pending registration in other countries. all other Bmc trademarks, service marks, and logos may be registered or pending registration in the u.s. or in other countries. oracle is a registered trademark of oracle corporation. unIX is the registered trademark of the open group in the us and other countries. all other trademarks or registered trademarks are the property of their respective owners. © 2011 Bmc software, Inc. all rights reserved.

*197132*

Business Runs on IT. IT Runs on BmC Software.Business thrives when IT runs smarter, faster and stronger. That’s why the most demanding IT organizations in the world rely on BMC

Software across distributed, mainframe, virtual and cloud environments. Recognized as the leader in Business Service Management,

BMC offers a comprehensive approach and unified platform that helps IT organizations cut cost, reduce risk and drive business profit.

For the four fiscal quarters ended December 31, 2010, BMC revenue was approximately $2 billion. Visit www.bmc.com for more information.

Documents

Automated Disaster Recovery With BMC Atrium Orchestrator