Upload
hoangminh
View
218
Download
2
Embed Size (px)
Citation preview
Applying the capabilities of IT Process Automation to help meet the daily
challenges faced by Disaster Recovery / IT Service Continuity Professionals
BEST PRACTICES WHITE PAPER
Automated Disaster Recovery With BMC Atrium Orchestrator
ConTEnTS
IntroductIon 1
the challenges 1
automatIon drIves BusIness recovery 1
modular approach to dIsaster recovery automatIon 2
Fast and eFFIcIent communIcatIon 3
the daIly ‘dIsasters’ 3
loss oF a data center – “actIve-actIve” 4
loss oF a data center – “actIve-passIve” 4
BeneFIts and showIng value 4
cost oF downtIme 5
cost oF hardware 5
cost oF staFF 5
rIsk oF dIsaster recovery plan drIFt 5
rIsk oF InFrequent testIng 5
the complete Bmc solutIon 6
conclusIons 6
1
InTRoduCTIonEvents such as 9/11, Hurricane Katrina, and other recent, high-profile disasters have brought a harsh realism to the potential
devastation that can impact any of us unexpectedly, at a moment’s notice. And while the concern for human tragedy will always
remain the central one during such times, there is a growing understanding of — and concern for — the impact to businesses,
as well. Indeed, with the rise in importance technology plays toward delivery of core business services, planning and preparing
for disaster scenarios has taken on its own essential role to ensure continuity of business services.
Disruption of business services, whether the result of minor technology failures that occur daily or full-scale disasters, can
be highly detrimental to a company’s financials, as well as to its reputation. Once you acknowledge the value technology has
to your organization, you must also consider the related consequences when and if that technology becomes temporarily
unavailable or more severely impaired due to catastrophic failure.
Business continuity planning is used to identify needs, analyze consequences, and develop recovery strategies specifically
designed to ensure operational continuity at a minimal level or standard in the event of a disruption to the business. Such
events come in a variety of shapes and sizes, but as the old adage goes, they always tend to come at the most unexpected
and inopportune times. The full impact of such events on production services is, as a rule, directly dependent on the speed
and efficiency with which IT Operation’s can detect a disruption, triage the impact on services, and then execute the
recovery process.
THE CHAllEngESThe common reality across IT is that, even today, detection is left to existing piece-part monitoring tools, and the overall
Disaster Recovery Lifecycle is managed in an ad-hoc fashion, manually developed on the fly. This often results in “less than
optimal” recovery times which will have a direct impact to an organization’s bottom line.
In some cases, the disaster recovery process may involve automated scripts that are triggered manually, in conjunction with
step-by-step instructions codified in procedural documents. Even still, the execution of the disaster recovery processes are
highly manual, difficult to coordinate, and cumbersome to achieve.
Business continuity managers also face challenges around testing and maintaining the currency of their disaster recovery plans
and procedures. Although many business continuity managers now sit on change review boards, it is certainly not uncommon
that some changes slip through the net, thus rendering the disaster recovery plan out of date and ineffective. This disaster
recovery plan “drift” problem is exacerbated due to the infrequency in which testing is done. Most organizations only test their
disaster recovery plans once a year (if at all!) and this is usually a very expensive operation requiring tens if not hundreds of
skilled IT staff working over nights and weekends.
AuTomATIon dRIvES BuSInESS RECovERyBMC Atrium Orchestrator is an enterprise-class automation platform that speeds and simplifies the process of developing
and maintaining business recovery scenarios and, most importantly, restoring service. BMC Atrium Orchestrator automates
any set of repeatable manual tasks and scripts as a workflow, ensuring speed, accuracy, and consistency of execution.
Through the BMC Atrium Orchestrator Development Studio, business recovery scenarios can be rendered, maintained, and
executed as repeatable workflows — all from a single location that is accessible globally. Whether recovery scenarios call for
multiple database restores, porting multiple applications to backup data center servers, reconfiguring SAN-based storage, or
simply notifying large numbers of people quickly, a BMC Atrium Orchestrator workflow can be built and executed to automate
the required tasks.
To provide a more complete solution, BMC Atrium Orchestrator also leverages the capabilities provided by other enterprise
management solutions. For example, event management solutions have become much more effective at identifying and
associating technical problems with impacts on business services. This maturity enables rapid identification of service-
impacting problems, which can then be picked up by the automation tool as a trigger to quickly drive the initiation of a
disaster recovery plan.
2
As an example, the BMC ProactiveNet Performance Management solution provides a real-time view of available business services
and the relative priority and importance of those services to the business. The underlying service models can accurately assess the
impact to any supporting technical component with the overall availability and performance of a business service. The additional
benefit of the solution’s service impact management functionality is that the encapsulated service models are generated from
information within the BMC Atrium CMDB, meaning that so long as the information in the CMDB is current (through proper change
processes), the service models are also automatically kept current, which helps alleviate some of the issues associated with
disaster recovery plan “drift”.
Figure 1: BMC ProactiveNet Performance Management – Business Service Impacted
BMC Atrium Orchestrator complements and easily links to other existing management systems, IT service management
solutions, and individual infrastructure devices enabling centralized control and visibility across an entire technology
infrastructure. BMC Atrium Orchestrator can execute recovery scenarios in a fully automated mode, semi-automated mode
(e.g. generate change requests/e-mails and gather authorizations before initiating a disaster recovery plan), or can be used
as a centralized interface from which an operator can run individual recovery scenarios as necessary.
modulAR APPRoACH To dISASTER RECovERy AuTomATIonFully automating all disaster recovery processes is not something that tends to happen overnight, so an incremental approach
is required. There are various levels of disaster recovery protection that provide building blocks to help form a more complete
end-to-end solution.
Component Protection: Most hardware (compute, network, and storage) provide some form of high-availability/ »clustering technology which is the first line of defense against failures,
Application / Business Service Protection: Business services can be monitored and managed as a logical entity. »Each business service can have its own set of recovery processes that can be initiated in isolation to any other
business service.
Site / Data center Protection: In extreme circumstances, loss of a location can be a risk which organizations need to »protect themselves against. In these cases, a much more complex and coordinated plan is required where multiple
business services are prioritized and recovered to a remote site.
3
A key factor when approaching disaster recovery is cost. As in many situations, there is a cost-vs-risk balance that needs to be
considered as part of any plan that is put together. What is the acceptable downtime for a component or business service and
what will the impact be to the business? How much are you willing to invest to keep downtime to a minimal?
Again, various tiers of protection can be implemented for each component or business service. For high-priority services,
an “active-active” approach may be acceptable. Although costly to implement, dedicated hardware and constant data
synchronization between remote sites can enable an extremely fast recovery process.
In other cases, dedicated hardware is too expensive, so an “active-passive” configuration may be more appropriate. This is
where a plan is put in place to utilize test or development platforms in the event of a disaster and reconfigure these systems
to manage the production environments. Typically, the time taken to recover “active-passive” configurations is longer, and the
steps taken to implement failover are more complex and risky.
There are also considerations about actual ownership of disaster recovery hardware. These days, organizations may choose
to have their own dedicated secondary data center or they may rent space from a specialist disaster recovery service provider.
With the advent of cloud computing, there are further options, which enable organizations to build their recovery plans utilizing
hosted resources in a public cloud.
FAST And EFFICIEnT CommunICATIonRegardless of the nature of a disaster, there is always a need to communicate quickly and effectively to all employees who
may be impacted — whether critical IT personnel needed to restore and verify services, users impacted by outages, or staff
required to report to an alternate site. In each of these situations, BMC Atrium Orchestrator can be the single point of control
and execution of communications. Workflows that interact with voice systems can also be executed to establish bridges for
announcements, call out to critical resources as a part of recovery to notify responsible IT personnel, or deliver infrastructures
to employees based on the nature of the disaster.
THE dAIly ‘dISASTERS’While we may not think of the outages that occur on a daily basis as disasters, they are certainly events that disrupt business
services; and, as discussed, the procedures for restoring service from these daily outages should be leveraged for larger events
that can, in many cases, truly be classified as disasters. Loss of a database environment is an event that can occur at any time
— as well as during a major disaster. While applications and networks may be working fine, without the database, the business
service is disrupted. BMC Atrium Orchestrator workflows can be written to execute database fallback or restore scenarios in
support of any or all environments. In the event of a loss of a database or larger event that affects multiple databases,
BMC Atrium Orchestrator workflows can be accessed and executed from any location to restore database services.
BMC Atrium Orchestrator workflows can be designed to execute in a fully automated fashion or interactively. In this scenario,
the BMC Atrium Orchestrator recovery workflow would intercept an event, such as an SNMP trap or similar notification from
an event management system, and based on the event type, automatically execute steps to:
Verify that a problem exists and that the event was not part of a planned outagea.
Determine what other resources may be affected and require attentionb.
Document the current state by opening a service desk incident or generating and distributing a reportc.
Provision hardware and software resources to replicate the operating environmentd.
Recover the data to the status quo antee.
Restore associated resourcesf.
Close the incident or update the reportg.
4
Figure 2: Example Disaster Recovery Workflow
While this example depicts a fully automated recovery process, it would be just as easy to insert pauses in the workflow to report
progress-to-date and request operator confirmation of next steps to perform. You can see how this isolated event-and-recovery
process can be incorporated into a larger process to recover from an outage that affects an entire end-to-end business service.
loSS oF A dATA CEnTER – “ACTIvE-ACTIvE”Many IT organizations employ a dual data center strategy where business services are running live in two hot data centers.
Both data centers are setup to run all critical business services, and at any point in time, services are running live in one of the
two data centers. Various types of events can result in what is operationally the loss of a data center. Events, such as power
loss, building destruction, or a disaster that impacts telecommunications services, result in business services in the data center
becoming unavailable. The first step in this situation is to determine what was lost and which databases and applications were
running in the ‘failed’ site. Once that has been determined, procedures can be executed to restore business services in the
operational data center. BMC Atrium Orchestrator workflows can be written and executed to identify those services that were
running in the failed site and provide fast guidance as to what requires recovery in the operational site. Once determined,
BMC Atrium Orchestrator workflows can be used to execute the appropriate recovery scenarios to quickly restore service.
loSS oF A dATA CEnTER – “ACTIvE-PASSIvE”In some environments, running a secondary hot data center is not practical. These IT organizations typically employ a cold
or warm backup site that contains the IT infrastructure components to recover critical business services, but does not keep
database and application infrastructure environments running. In this case, BMC Atrium Orchestrator workflows can be written
for each specific business service that executes the tasks to load and bring up application environments, applications, and
databases in backup data center.
BEnEFITS And SHoWIng vAluEBusiness Continuity managers will often also want — or be required to — show the value that automation provides. Executives will
want to see the returns on any investment made in automation technology or understand the level to which they have mitigated risk.
Typically the three big indicators around cost for a disaster recovery plan are
Cost of downtime of a service to the business (can be both a financial cost and impact to business reputation.) »Cost of hardware / real estate to implement plans »Cost of staff to test or, if necessary, implement disaster recovery plans »
5
There are also risk factors to consider
Currency of disaster recovery plan and procedures »Frequency in which they can be tested. »
Automation can help in all of these areas.
CoST oF doWnTImEThe biggest, measurable benefit of automation is likely to be around the time taken to recover a service. If the loss of a critical
business service can cost a business $500k an hour, reducing the recovery window from 4 hours to 30 minutes is a very
compelling story.
CoST oF HARdWAREWith the advent of virtualization technologies, there is much greater flexibility in the use (or not) of dedicated physical hardware.
Virtual images can now easily be copied and migrated between physical hypervisor hosts and rebooted and reconfigured on
the fly using automation technology. Not only does this speed the recovery time in an “active-passive” situation, but existing
hardware running non-production virtual images can quickly be re-purposed to host the production environment to quickly
restore service.
CoST oF STAFFIn non-automated environments, huge swathes of expensive, experienced IT engineers are required to properly test or
implement a disaster recovery plan. In larger environments, testing can involve hundreds of staff, working over a weekend.
Without automation, a “disaster recovery Playbook” which describes the recovery procedures is walked through step by step
by many different IT teams (Network, UNIX®, Oracle® DB etc). After each step, the IT team with responsibility for the next step in
the playbook needs to be contacted. Then, that team needs to notify the next team that their steps have been completed, and so
forth. Automation would manage both the communication and orchestration of a recovery plan, vastly reducing the number of
people required to either test or initiate the disaster recovery plan.
RISk oF dISASTER RECovERy PlAn dRIFTA common problem, which is often exacerbated by the infrequency of testing, is that disaster recovery plans quickly become
out of date. Perhaps part of a service is moved to a new server or additional load balancers are added to make the service more
resilient. In either case, you will get one of two things happening.
If a service does go down and the disaster recovery plan is initiated, at some point during the recovery process, something a.
isn’t going to work, which will add time to the recovery window and impact the business whilst the error is tracked down.
You will get false notifications of disasters when really the business services are functioning just fine. In most cases, you’d b.
expect confirmation of a disaster before any plan would be initiated, so these kind of issues should cause limited exposure.
Still, it is an unwanted distraction.
The solution in both of these cases is tight change and configuration controls and, again, automation can play its part in ensuring
these processes are always executed as part of any infrastructure updates. BMC Atrium Orchestrator has specific runbooks
which integrate with server, network, and database configuration tools which automatically generate change tasks for any
updates made, thus keeping an accurate audit record of change and also keeping the CMDB up to date. In the BMC ProactiveNet
Performance Management example, this would in turn maintain the business service models and impacts of technology faults
on supporting configuration items ensuring that the monitoring / alerting mechanism is also current and accurate.
RISk oF InFREquEnT TESTIngAs in the section above, infrequent testing of a disaster recovery plan results in inaccuracies in the plan. Whereas before
automation a disaster recovery plan could take tens or even hundreds of staff many hours to test, automation could test the plan
in a fraction of the time using far fewer people. The combination of fewer people and much faster testing times means that plans
can be tested on a much more frequent basis, greatly reducing the risk of out of date plans.
6
THE ComPlETE BmC SoluTIon
Site A
Trading Service
Site B / CLOUD
Trading Service
Atrium Orchestrator
CMDB / CMS
Service Model
Business Service User1
2
34
5
6
Authorizations Decision makers confirm disaster (Example: Change requests generated in Remedy ITSM – wait on ap provals)Failover Process (e.g.) Shutdown what's le� of production environment Re-allocate resources at DR site Data synchronization Restart service at DR site Re-direct users
1
2
3
4
5
6
Service Model generated from CMDB
Real time monitoring of service model through BMC ProactiveNet Performance Manager
Service Impacting event causes service outage / failure
Service Impact alert picked up by BMC Atrium Orchestrator
Atrium Orchestrator initiates associated DR workflow.
Service is automatically recovered at secondary site and service resumed.
ConCluSIonSThe disaster recovery and business continuity processes in place at most companies typically consist of written procedures
augmented by traditional systems management tools for recovering IT resources. This fragmented approach extends recovery
time and hinders continuous process improvement initiatives. BMC Atrium Orchestrator provides a single point of visibility and
control for executing business recovery in the event of minor ‘daily disasters’ or major events that disrupt business services.
It provides immediate value by allowing you to incrementally build-out your recovery processes, automating key recovery
processes first, until you have a fully integrated end-to-end process. And by implementing automation, you get a reliable,
repeatable process that will serve as the foundation for continuous process improvement.
Bmc, Bmc software, and the Bmc software logo are the exclusive properties of Bmc software, Inc., are registered with the u.s. patent and trademark office, and may be registered or pending registration in other countries. all other Bmc trademarks, service marks, and logos may be registered or pending registration in the u.s. or in other countries. oracle is a registered trademark of oracle corporation. unIX is the registered trademark of the open group in the us and other countries. all other trademarks or registered trademarks are the property of their respective owners. © 2011 Bmc software, Inc. all rights reserved.
*197132*
Business Runs on IT. IT Runs on BmC Software.Business thrives when IT runs smarter, faster and stronger. That’s why the most demanding IT organizations in the world rely on BMC
Software across distributed, mainframe, virtual and cloud environments. Recognized as the leader in Business Service Management,
BMC offers a comprehensive approach and unified platform that helps IT organizations cut cost, reduce risk and drive business profit.
For the four fiscal quarters ended December 31, 2010, BMC revenue was approximately $2 billion. Visit www.bmc.com for more information.