1
Disaster RecoveryAwareness
2
Environmental: Fire, flood, earthquake, etc.
Loss of utilities & services: Power interruption, etc.
Systems or equipment failure
Information security breach
Civil protest/unrest
Disruptions in third parties and business partners
Pandemic outbreaks
Acts of terrorism
Threats to Business
3
Revenue Loss Angry CustomersNon-Compliance
with Laws and Regulations
Embarrassment
Impacts Due to Disaster
4
Definitions, Acronyms and Abbreviations
Business Continuity (BC) takes Disaster Recovery one step further, and includes facilities, work force continuity, vital business operations, etc.
Disaster Recovery (DR) is the ability to continue with services in the case of major IT outages, often with reduced capabilities or performance. Disaster Recovery solutions typically involve manual activities.
High Availability (failover) is the characteristic of a system to protect against or recover from minor IT outages in a short time frame with largely automated means.
5
Definitions, Acronyms and Abbreviations (Continued)
Recovery Time Objective (RTO) The maximum elapsed time a business function can sustain a service interruption from the time a crisis is identified to the restoration of service.
Recovery Point Objective (RPO) The amount of data loss that is acceptable. For example, if an RPO of 6 hours is acceptable, SPE will be able to restore systems back to the state they were in 6 hours before the disaster or less. Any data created or modified inside a recovery point objective will be either lost or must be recreated during a recovery.
Tier 0 Basic Infrastructure (Such as DNS, LDAP, Active Directory, network services, etc.), required for business application to run.
Tier 1 Application with Recovery Time Objective of less than 12 hours and Recovery Point Objective of less than 5 minutes.
Tier 2 Application with Recovery Time Objective of more than 12 hours but less than 48 hours, and Recovery Point Objective of less than 24 hours.
Tier 3 Application with Recovery Time Objective of more than 48 hours and less than 15 days, and Recovery Point Objective of less than 36 hours.
Tier 4 Application with Recovery Time Objective of more than 15 days and Recovery Point Objective of less than 36 hours.
6
Disaster Recovery Plan:
• Aims to provide an organized way to respond to a disruptive event.
• Addresses the IT procedures to be followed before, during and after a loss to identified applications at Primary Datacenter (Chandler, AZ)
• The primary objective is to provide the capability to bring up critical applications and data at an alternate site within a time frame that minimizes the loss to the organization.
• Per Business Impact Analysis 2008 – Tier’s determined & DR Environment implemented for the following SPE Business Critical Applications. As of January 2013 – DR Systems located at El Segundo colo. (SAP in process of establishing DR environment)
• eVMI• ITSM/C2C• PRISM• TAAS• GPMS
DR Plan
7
DR Site
DR Site Name: Digital Realty Trust co-location. (El Segundo Datacenter)
Address: 2260 E El Segundo Blvd,
City, State ZIP: El Segundo, CA 90245
Directions: 15 minute drive from Corporate Pointe, Culver City, CA
Contact Name / Phone: Ron Goede
Director – Facilities SPE - EIS
600 Corporate Pointe
310.665 6639 Office
310.678.6461 Cell
E-mail: [email protected]
Web: http://www.digitalrealtytrust.com/
8
DR Process Life Cycle
Project Planning- Completed in 2008
Risk Assessment & Business Impact Analysis
DR Testing &Exercising
Awareness & Training
Maintenance & Updating
DR testing & exercising is performed after implementation phase to ensure DR environment can be brought up within the required time (RTO) using the documented DR procedure.
Completed in 2008
eVMI, PRISM, ITSM/C2C, TAAS,
GPMS
DR Environment ImplementationEstablished in 2008 –updates in progress
DR Procedure Documentation
9
• Applications were classified as Tier1/Tier2/Tier3 based on the results of Risk Assessment
• Determined RTO and RPO to ensure minimum business impact
• DR environments were set up at the DR site (currently being updated to mirror Prod.)
• DR Procedure Documents developed (currently being updated)
• Tests conducted
• Maintain the Document
DR Methodology
Tier RTO RPO
1 Less than 12 hrs Minutes to 6 hrs
2 12-48 hrs 6-24 hrs
3 2 wks Less than 36 hrs
4 4 wks Less than 36 hrs
10
Tiers – based on Business Impact Analysis 2008
Application name Division Tier Financial Impact Per Day
5 Time And Attendance System (TAAS) Corporate Tier 1 $6 M (If it occurs on any Monday)
25 eVMI HE Tier 1 $4,693,334
38 ITSM, C2C, ITSM-S, SARA TV-I Tier 2 $3.2 Million
30 SPIRIT MP Tier 1 $2.7 Million
8 Timecapture Imageworks Corporate Tier 1 > $2 M (If it occurs on any Monday or Friday)
24 PRISM & Pathfinder HE Tier 2 $1 Million
34 Domestic TV Sales Management TV-D Tier 2 $64,286
23 Sales Estimation Tool (SET) HE Tier 2(Tier 1?) $50,000
31 SPIRITworld MP Tier 1 $18,600
28 GPMS Suite MP Tier 3 $18,000
27 InterPlan MP Tier 2(Tier 1?) $16,000
7 Hyperion Enterprise Corporate Tier 1(Tier 2?) $5,000
14 Sony Pictures Stock Footage DMG Tier 1 $5,000
22 Revenue Sharing System (REVSHARE) HE Tier 3 $5,000
11 cineSHARE+ / cineVIEW DMG Tier 2 $3,900
4 HR Connection Corporate Tier 3 $3,000
1 IRIMS – Incident Management Corporate Tier 3 $800
6 Medical Services System - Medgate Corporate Tier 3 $350
3 Remedy IT Corporate Tier 1 $0
9 E-911 LAN Alert (Siemens Phone System) Corporate Tier 1(Tier 3?) $0
17 SPiDR (Stellent) ETS Tier 1(Tier 0?) $0
12 Secure File Distribution Services DMG Tier 2 $0
20 mySPE, mySPEJ, mySPTI ETS Tier 2 $0
13 Digital Media Repository DMG Tier 3 $0
36 DealMaker TV-D Tier 3 $0
37 SPT B2B Marketing TV-D Tier 3 $0
10 ACORN/DMI DMG Tier 1 * Tier 1 Tier 2 Tier 3
29 Spent and Committed MP Tier 1 *
32 Automated System for Ad/Pub fulfillment (ASAP) MP Tier 2 *
39 Harris Landmark Traffic and Ad Sales TV-I Tier 2 *
40 Harris VISION Scheduling TV-I Tier 2 *
2 CorpTax Corporate Tier 3 *
* Awaiting final number
11
Disaster Levels
Recovery Team Leader to mobilize all Recovery Teamsand guide them to
relocate toanother SPE
location outside LA or mobilize
teams from other SPE locationoutside LA
(London, etc.)
Recovery Team Leader to mobilize
the required Recovery Teams and
guide them to relocate to another area within the same premises
Level II
Recovery Team Leader to mobilize
the required Recovery Teams and
guide them to relocate to another
SPE location within LA
Begin Recovery
Procedures
Operations at Recovery Location
Restoration
Can we operate from
any other location within
LA (SPE Lot)?
Asset Level Recovery City Level RecoverySite Level Recovery
Level I Level III
Can we work from the
same premises?
Begin Recovery
Procedures
Begin Recovery
Procedures
12
DR Network Strategy
EmergencyTypes Description Causes Network Recovery Strategy
Level 1 Asset Level Failure
System crashes, disk failures, theft, hacking, virus threats
SPE LAN to be used primarily. However, if administrators are away from the primary site, they could use VPN and make the DR site as primary.
Level 2 Site Level Failure
Major power outages, communication link failures, network component failures, sabotage
Administrators would be required to use Sony LAN from the alternate site or VPN and make the DR site as primary.
Level 3 City Level Failure
Total outage because of natural or manmade disasters
Administrators in the US may not be able to use either LAN or VPN and would need to depend on administrators outside US (Europe/India etc).
13
Crisis Management Team/Facilities
Disaster
Recovery FlowchartDisaster
Notify DR Team &Crisis Management Team
Verification & Assessment
Management Alert
DR Plan Activation & Notification Procedures
Assemble Teams
Perform Recovery at Alternate Site Begin Salvage Operations at Primary Site
Return to Normal Operation
Damage Assessment Team
DR Teams
Salvage TeamIT Recovery Team
Normal Operations Team
End of Crisis
Emergency Plan Activation (Evacuation)
DR Operations Flow
14
Teams & Responsibilities
Team ResponsibilitiesCrisis Management Team • Directs, allocates resources and oversees the situation
Damage Assessment Team • Assesses the impact of disaster event and informs Crisis Management Team• Coordinates the collection of information & evidence required for insurance claims
Salvage Team • Returns primary site to normal operating conditions• Clears and repairs the primary processing facility
Disaster Recovery Team • Implements recovery procedure at DR Site• Recovers the systems & data at DR Site
Normal Operations Team • Returns Production from DR to Primary Site
Other teams required during Disaster Recovery are:
• Emergency Response Team – Evacuation of employees, coordinating with non-SPE emergency teams (Fire Dept., etc.)• Public Relations Team – Handling media, public, etc.• Security Incident Response Team – If there is any security incident (Denial of Service (DoS), Virus Attack, etc.)
Current Status
•DR architecture has not been updated since 2011 to match Production architecture.
•Operational Testing documents for the 5 identified applications that have DR in place (PRISM, eVMI, ITSM/C2C, GPMS and TAAS ) date back to 2011
•In the process of determining the requirements for additional infrastructure:
oTo account for changes in the architecture and Infrastructure for currently established DR applicationsoTo provision Infrastructure for changes in the architecture and infrastructure for existing DR platforms as well as new DR platforms (such as Oracle on AIX, JBoss, etc.)
Current Status
•Business Impact Analysis was completed in 2007 – since then no formal updates have been published. New applications may not have a DR environment/plan
•SAP -Draft of Disaster Recovery Plan being finalized. Infrastructure has been provisioned.
•Per Business Continuity Team (Scot Falkenstein) - we are aligned with policy effective May 12, 2012
•Testing will need to be re-scheduled once the architectural and design changes, Version upgrades and WebLogic to JBoss conversions are completed. (TAAS, ITSM/C2C, GPMS, PRISM)
1. Review current
documents on File
2. Perform
Initial Analysis
Architecture updates? Personnel changes?Call tree?
3. Meet with
Application & Platform
Leads
4.Coordinate Application review and
updates
5. Confirm
DR Servers including platform
interfaces
7.Update DR
Env. & Testing
Documents
8. Schedule
Test
9. Conduct
Test
10.Document
results, learnings
and resolve issues
11.Prepare for Full
Test
6.Upgrade platforms
Process Flow for 2013 DR Parallel Testing
ITSM, PRISM, TAAS, WebMethods, WDG, Datastage, IDM, D-Series, Alfresco,
Oracle, JBoss, Weblogic, Apache, BO
SAP eVMI
18
Network/ Storage Team Server Team Database Team ETL Team Application Team
Disaster Declared@ Chandler
Activate the Salvage team torecover/salvage
from the Primary site
Send notificationsTo DR Teams
Check & stop applications or services
running at Chandler(if any)
Stop replication
Recover & import volumes
Mount volumes &check file systems
All volumes
mounted & file systemsconsistent?
No
Mount and start the Oracle and DB2
databases
Start middleware(webMethods, Datastage
& D-Series)
Change Pegasus interface to point
to DR site
Crisis ManagementTeam
Yes
A
E
Disaster Recovery Flow Chart – During Disaster (eVMI Recovery at DR Site)
Ensure connectivity to DR Site
Network team
Storage team
SAMPLE
IT Recovery Coordinator
19
Network/Storage Team Server Team Database Team ETL Team Application TeamCrisis Management
Team
Start eVMI Application & Services
Test eVMI services &Test eVMI application
Is eVMI
applicationrunning?
Inform users to Switch over to DR URL
Inform Crisis Management TeamDeclare DR site up
Monitor the situation
IsPrimary site
up?
B
No
Yes
A
E
Disaster Recovery Flow Chart – During Disaster (Recovery at DR Site) Continued...
Yes
No
SAMPLE
IT Recovery Coordinator
20
IT Recovery Coordinator
Network/Storage Team Server Team Database Team ETL Team Application Team
Start reverse replicationof data from DR site
Stop replication,recover & import primary volumes
Mount Primary volumes &check file systems
All volumes
mounted & file systemsconsistent?
No
Mount and start the Oracle and DB2
databases
Start middleware(webMethods, Datastage
& D-Series)
Change Pegasus interface to point
to Primary site
Crisis ManagementTeam
Yes
B
Start Siteminder & eVMI Application & Services
Test eVMI services &Test eVMI application
Is eVMI
applicationrunning?
C
No
Yes
Disaster Recovery Flow Chart – After Disaster (Restoration to Normal Operations)
Network team
Storage team
SAMPLE
Ensure connectivity to primary site
21
Network/Storage Team Server Team Database Team ETL Team Application TeamCrisis Management
Team
Inform users to Switch-over to
Chandler Prod. URL
Inform Crisis Management Team
Declare Primarysite up
Start replication to DR
C
Check & stop applications or services
running at DR(if any)
STOP
Disaster Recovery Flow Chart – After Disaster (Restoration to Normal Operations) Continued…
Network team
Storage team
SAMPLE
IT Recovery Coordinator
22
Important Contact Details
Notification Tree Sample - eVMI.
SAMPLE
23
Type of Tests
• Parallel– The DR site is activated– Only test transactions are carried
out at the DR site– Main business continues at the
primary site without any interruptions
• Full– The primary (Prod) site is shut
down– DR site is activated as the primary– Business is conducted at the DR
site
Testing Strategy and Prioritization
•DR testing for current DR Applications has been prioritized by determining when their changes, upgrades and conversions will be complete.
oIf changes are due to complete in the next 3 months (May 2013) ,we have scheduled DR testing to occur after changes are complete. (PRISM & ITSM/C2C)
oIf changes are greater than 3 months away we will test as soon as architecture has been updated, then re-test after changes are made.
o eVMI has fewer dependencies therefore we will conduct some testing in April 2013, a Parallel test in June 2013 & re-test in the 4th Quarter when all changes are complete.
oThe GPMS team has specifically requested that we delay all DR testing till October 2013
oSAP failover to El Segundo has been tested successfully (March 2013) . DR Parallel testing will be conducted once all changes/updates are complete
Tentative Schedule – DR Parallel Testing
eVMIAPRIL 2013
TAAS MAY 2013
PRISM MAY/JUNE 2013
ITSM/C2CMAY/JUNE 2013
GPMS 4th QUARTER PER REQUEST
SAPMARCH 2013 – (FAILOVER TESTED)
26
You cannot control disaster.But you can be prepared for it.