Download pdf - DR Awareness Presentation for App Teamsv3.pptO-Media...•SAP -Draft of Disaster Recovery Plan being finalized. Infrastructure has been provisioned. •Per Business Continuity Team

1

Disaster RecoveryAwareness

2

Environmental: Fire, flood, earthquake, etc.

Loss of utilities & services: Power interruption, etc.

Systems or equipment failure

Information security breach

Civil protest/unrest

Disruptions in third parties and business partners

Pandemic outbreaks

Acts of terrorism

Threats to Business

3

Revenue Loss Angry CustomersNon-Compliance

with Laws and Regulations

Embarrassment

Impacts Due to Disaster

4

Definitions, Acronyms and Abbreviations

Business Continuity (BC) takes Disaster Recovery one step further, and includes facilities, work force continuity, vital business operations, etc.

Disaster Recovery (DR) is the ability to continue with services in the case of major IT outages, often with reduced capabilities or performance. Disaster Recovery solutions typically involve manual activities.

High Availability (failover) is the characteristic of a system to protect against or recover from minor IT outages in a short time frame with largely automated means.

5

Definitions, Acronyms and Abbreviations (Continued)

Recovery Time Objective (RTO) The maximum elapsed time a business function can sustain a service interruption from the time a crisis is identified to the restoration of service.

Recovery Point Objective (RPO) The amount of data loss that is acceptable. For example, if an RPO of 6 hours is acceptable, SPE will be able to restore systems back to the state they were in 6 hours before the disaster or less. Any data created or modified inside a recovery point objective will be either lost or must be recreated during a recovery.

Tier 0 Basic Infrastructure (Such as DNS, LDAP, Active Directory, network services, etc.), required for business application to run.

Tier 1 Application with Recovery Time Objective of less than 12 hours and Recovery Point Objective of less than 5 minutes.

Tier 2 Application with Recovery Time Objective of more than 12 hours but less than 48 hours, and Recovery Point Objective of less than 24 hours.

Tier 3 Application with Recovery Time Objective of more than 48 hours and less than 15 days, and Recovery Point Objective of less than 36 hours.

Tier 4 Application with Recovery Time Objective of more than 15 days and Recovery Point Objective of less than 36 hours.

6

Disaster Recovery Plan:

• Aims to provide an organized way to respond to a disruptive event.

• Addresses the IT procedures to be followed before, during and after a loss to identified applications at Primary Datacenter (Chandler, AZ)

• The primary objective is to provide the capability to bring up critical applications and data at an alternate site within a time frame that minimizes the loss to the organization.

• Per Business Impact Analysis 2008 – Tier’s determined & DR Environment implemented for the following SPE Business Critical Applications. As of January 2013 – DR Systems located at El Segundo colo. (SAP in process of establishing DR environment)

• eVMI• ITSM/C2C• PRISM• TAAS• GPMS

DR Plan

7

DR Site

DR Site Name: Digital Realty Trust co-location. (El Segundo Datacenter)

Address: 2260 E El Segundo Blvd,

City, State ZIP: El Segundo, CA 90245

Directions: 15 minute drive from Corporate Pointe, Culver City, CA

Contact Name / Phone: Ron Goede

Director – Facilities SPE - EIS

600 Corporate Pointe

310.665 6639 Office

310.678.6461 Cell

E-mail: [email protected]

Web: http://www.digitalrealtytrust.com/

8

DR Process Life Cycle

Project Planning- Completed in 2008

Risk Assessment & Business Impact Analysis

DR Testing &Exercising

Awareness & Training

Maintenance & Updating

DR testing & exercising is performed after implementation phase to ensure DR environment can be brought up within the required time (RTO) using the documented DR procedure.

Completed in 2008

eVMI, PRISM, ITSM/C2C, TAAS,

GPMS

DR Environment ImplementationEstablished in 2008 –updates in progress

DR Procedure Documentation

9

• Applications were classified as Tier1/Tier2/Tier3 based on the results of Risk Assessment

• Determined RTO and RPO to ensure minimum business impact

• DR environments were set up at the DR site (currently being updated to mirror Prod.)

• DR Procedure Documents developed (currently being updated)

• Tests conducted

• Maintain the Document

DR Methodology

Tier RTO RPO

1 Less than 12 hrs Minutes to 6 hrs

2 12-48 hrs 6-24 hrs

3 2 wks Less than 36 hrs

4 4 wks Less than 36 hrs

10

Tiers – based on Business Impact Analysis 2008

Application name Division Tier Financial Impact Per Day

5 Time And Attendance System (TAAS) Corporate Tier 1 $6 M (If it occurs on any Monday)

25 eVMI HE Tier 1 $4,693,334

38 ITSM, C2C, ITSM-S, SARA TV-I Tier 2 $3.2 Million

30 SPIRIT MP Tier 1 $2.7 Million

8 Timecapture Imageworks Corporate Tier 1 > $2 M (If it occurs on any Monday or Friday)

24 PRISM & Pathfinder HE Tier 2 $1 Million

34 Domestic TV Sales Management TV-D Tier 2 $64,286

23 Sales Estimation Tool (SET) HE Tier 2(Tier 1?) $50,000

31 SPIRITworld MP Tier 1 $18,600

28 GPMS Suite MP Tier 3 $18,000

27 InterPlan MP Tier 2(Tier 1?) $16,000

7 Hyperion Enterprise Corporate Tier 1(Tier 2?) $5,000

14 Sony Pictures Stock Footage DMG Tier 1 $5,000

22 Revenue Sharing System (REVSHARE) HE Tier 3 $5,000

11 cineSHARE+ / cineVIEW DMG Tier 2 $3,900

4 HR Connection Corporate Tier 3 $3,000

1 IRIMS – Incident Management Corporate Tier 3 $800

6 Medical Services System - Medgate Corporate Tier 3 $350

3 Remedy IT Corporate Tier 1 $0

9 E-911 LAN Alert (Siemens Phone System) Corporate Tier 1(Tier 3?) $0

17 SPiDR (Stellent) ETS Tier 1(Tier 0?) $0

12 Secure File Distribution Services DMG Tier 2 $0

20 mySPE, mySPEJ, mySPTI ETS Tier 2 $0

13 Digital Media Repository DMG Tier 3 $0

36 DealMaker TV-D Tier 3 $0

37 SPT B2B Marketing TV-D Tier 3 $0

10 ACORN/DMI DMG Tier 1 * Tier 1 Tier 2 Tier 3

29 Spent and Committed MP Tier 1 *

32 Automated System for Ad/Pub fulfillment (ASAP) MP Tier 2 *

39 Harris Landmark Traffic and Ad Sales TV-I Tier 2 *

40 Harris VISION Scheduling TV-I Tier 2 *

2 CorpTax Corporate Tier 3 *

* Awaiting final number

11

Disaster Levels

Recovery Team Leader to mobilize all Recovery Teamsand guide them to

relocate toanother SPE

location outside LA or mobilize

teams from other SPE locationoutside LA

(London, etc.)

Recovery Team Leader to mobilize

the required Recovery Teams and

guide them to relocate to another area within the same premises

Level II

Recovery Team Leader to mobilize

the required Recovery Teams and

guide them to relocate to another

SPE location within LA

Begin Recovery

Procedures

Operations at Recovery Location

Restoration

Can we operate from

any other location within

LA (SPE Lot)?

Asset Level Recovery City Level RecoverySite Level Recovery

Level I Level III

Can we work from the

same premises?

Begin Recovery

Procedures

Begin Recovery

Procedures

12

DR Network Strategy

EmergencyTypes Description Causes Network Recovery Strategy

Level 1 Asset Level Failure

System crashes, disk failures, theft, hacking, virus threats

SPE LAN to be used primarily. However, if administrators are away from the primary site, they could use VPN and make the DR site as primary.

Level 2 Site Level Failure

Major power outages, communication link failures, network component failures, sabotage

Administrators would be required to use Sony LAN from the alternate site or VPN and make the DR site as primary.

Level 3 City Level Failure

Total outage because of natural or manmade disasters

Administrators in the US may not be able to use either LAN or VPN and would need to depend on administrators outside US (Europe/India etc).

13

Crisis Management Team/Facilities

Disaster

Recovery FlowchartDisaster

Notify DR Team &Crisis Management Team

Verification & Assessment

Management Alert

DR Plan Activation & Notification Procedures

Assemble Teams

Perform Recovery at Alternate Site Begin Salvage Operations at Primary Site

Return to Normal Operation

Damage Assessment Team

DR Teams

Salvage TeamIT Recovery Team

Normal Operations Team

End of Crisis

Emergency Plan Activation (Evacuation)

DR Operations Flow

14

Teams & Responsibilities

Team ResponsibilitiesCrisis Management Team • Directs, allocates resources and oversees the situation

Damage Assessment Team • Assesses the impact of disaster event and informs Crisis Management Team• Coordinates the collection of information & evidence required for insurance claims

Salvage Team • Returns primary site to normal operating conditions• Clears and repairs the primary processing facility

Disaster Recovery Team • Implements recovery procedure at DR Site• Recovers the systems & data at DR Site

Normal Operations Team • Returns Production from DR to Primary Site

Other teams required during Disaster Recovery are:

• Emergency Response Team – Evacuation of employees, coordinating with non-SPE emergency teams (Fire Dept., etc.)• Public Relations Team – Handling media, public, etc.• Security Incident Response Team – If there is any security incident (Denial of Service (DoS), Virus Attack, etc.)

Current Status

•DR architecture has not been updated since 2011 to match Production architecture.

•Operational Testing documents for the 5 identified applications that have DR in place (PRISM, eVMI, ITSM/C2C, GPMS and TAAS ) date back to 2011

•In the process of determining the requirements for additional infrastructure:

oTo account for changes in the architecture and Infrastructure for currently established DR applicationsoTo provision Infrastructure for changes in the architecture and infrastructure for existing DR platforms as well as new DR platforms (such as Oracle on AIX, JBoss, etc.)

Current Status

•Business Impact Analysis was completed in 2007 – since then no formal updates have been published. New applications may not have a DR environment/plan

•SAP -Draft of Disaster Recovery Plan being finalized. Infrastructure has been provisioned.

•Per Business Continuity Team (Scot Falkenstein) - we are aligned with policy effective May 12, 2012

•Testing will need to be re-scheduled once the architectural and design changes, Version upgrades and WebLogic to JBoss conversions are completed. (TAAS, ITSM/C2C, GPMS, PRISM)

1. Review current

documents on File

2. Perform

Initial Analysis

Architecture updates? Personnel changes?Call tree?

3. Meet with

Application & Platform

Leads

4.Coordinate Application review and

updates

5. Confirm

DR Servers including platform

interfaces

7.Update DR

Env. & Testing

Documents

8. Schedule

Test

9. Conduct

Test

10.Document

results, learnings

and resolve issues

11.Prepare for Full

Test

6.Upgrade platforms

Process Flow for 2013 DR Parallel Testing

ITSM, PRISM, TAAS, WebMethods, WDG, Datastage, IDM, D-Series, Alfresco,

Oracle, JBoss, Weblogic, Apache, BO

SAP eVMI

18

Network/ Storage Team Server Team Database Team ETL Team Application Team

Disaster Declared@ Chandler

Activate the Salvage team torecover/salvage

from the Primary site

Send notificationsTo DR Teams

Check & stop applications or services

running at Chandler(if any)

Stop replication

Recover & import volumes

Mount volumes &check file systems

All volumes

mounted & file systemsconsistent?

No

Mount and start the Oracle and DB2

databases

Start middleware(webMethods, Datastage

& D-Series)

Change Pegasus interface to point

to DR site

Crisis ManagementTeam

Yes

A

E

Disaster Recovery Flow Chart – During Disaster (eVMI Recovery at DR Site)

Ensure connectivity to DR Site

Network team

Storage team

SAMPLE

IT Recovery Coordinator

19

Network/Storage Team Server Team Database Team ETL Team Application TeamCrisis Management

Team

Start eVMI Application & Services

Test eVMI services &Test eVMI application

Is eVMI

applicationrunning?

Inform users to Switch over to DR URL

Inform Crisis Management TeamDeclare DR site up

Monitor the situation

IsPrimary site

up?

B

No

Yes

A

E

Disaster Recovery Flow Chart – During Disaster (Recovery at DR Site) Continued...

Yes

No

SAMPLE


20


Network/Storage Team Server Team Database Team ETL Team Application Team

Start reverse replicationof data from DR site

Stop replication,recover & import primary volumes

Mount Primary volumes &check file systems

All volumes

mounted & file systemsconsistent?

No

Mount and start the Oracle and DB2

databases

Start middleware(webMethods, Datastage

& D-Series)

Change Pegasus interface to point

to Primary site

Crisis ManagementTeam

Yes

B

Start Siteminder & eVMI Application & Services

Test eVMI services &Test eVMI application

Is eVMI

applicationrunning?

C

No

Yes

Disaster Recovery Flow Chart – After Disaster (Restoration to Normal Operations)

Network team

Storage team

SAMPLE

Ensure connectivity to primary site

21

Network/Storage Team Server Team Database Team ETL Team Application TeamCrisis Management

Team

Inform users to Switch-over to

Chandler Prod. URL

Inform Crisis Management Team

Declare Primarysite up

Start replication to DR

C

Check & stop applications or services

running at DR(if any)

STOP

Disaster Recovery Flow Chart – After Disaster (Restoration to Normal Operations) Continued…

Network team

Storage team

SAMPLE


22

Important Contact Details

Notification Tree Sample - eVMI.

SAMPLE

23

Type of Tests

• Parallel– The DR site is activated– Only test transactions are carried

out at the DR site– Main business continues at the

primary site without any interruptions

• Full– The primary (Prod) site is shut

down– DR site is activated as the primary– Business is conducted at the DR

site

Testing Strategy and Prioritization

•DR testing for current DR Applications has been prioritized by determining when their changes, upgrades and conversions will be complete.

oIf changes are due to complete in the next 3 months (May 2013) ,we have scheduled DR testing to occur after changes are complete. (PRISM & ITSM/C2C)

oIf changes are greater than 3 months away we will test as soon as architecture has been updated, then re-test after changes are made.

o eVMI has fewer dependencies therefore we will conduct some testing in April 2013, a Parallel test in June 2013 & re-test in the 4th Quarter when all changes are complete.

oThe GPMS team has specifically requested that we delay all DR testing till October 2013

oSAP failover to El Segundo has been tested successfully (March 2013) . DR Parallel testing will be conducted once all changes/updates are complete

Tentative Schedule – DR Parallel Testing

eVMIAPRIL 2013

TAAS MAY 2013

PRISM MAY/JUNE 2013

ITSM/C2CMAY/JUNE 2013

GPMS 4th QUARTER PER REQUEST

SAPMARCH 2013 – (FAILOVER TESTED)

26

You cannot control disaster.But you can be prepared for it.