40
University of Southern California Center for Systems and Software Engineering Recovering IT in a Disaster & Classic Mistakes CS 577b Software Engineering II Supannika Koolmanojwong

Recovering IT in a Disaster & Classic Mistakes

  • Upload
    ashley

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Recovering IT in a Disaster & Classic Mistakes. CS 577b Software Engineering II Supannika Koolmanojwong. http://en.wikipedia.org/wiki/Hurricane_Katrina http://napoleonlive.info/see-the-evidence/never-forget-9-11-essay/ http://news.nationalgeographic.com. Avian influenza. Cyber attack. - PowerPoint PPT Presentation

Citation preview

Page 1: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Recovering IT in a Disaster &

Classic MistakesCS 577b Software Engineering II

Supannika Koolmanojwong

Page 2: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

http://en.wikipedia.org/wiki/Hurricane_Katrinahttp://napoleonlive.info/see-the-evidence/never-forget-9-11-essay/http://news.nationalgeographic.com 2

Page 3: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Avian influenza

http://www.itrportal.com/absolutenm/templates/article-channelnews.aspx?articleid=7115&zoneid=45http://bepast.org/dataman.pl?c=lib&frame_nav=1&dir=docs/photos/avian%20flu/

Cyber attack

3

Page 4: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

California Natural Disasters

http://www.americanforests.org/magazine/article/regrowing-a-forest/http://www.exponent.com/earthquake_engineering/

4

Page 5: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Recovering IT in a Disaster: Lessons from Hurricane Katrina

Iris Junglas, Blake Ives, MIS Quarterly Executive Vol. 6 No. 1 / Mar 2007

• August 29, 2005 - Hurricane Katrina destroyed a data center and communications infrastructure at the Pascagoula and Gulfport, Mississippi, operations of the Ship Systems sector of Northrop Grumman Corporation

• Also put a second data center out of commission in a shipyard near New Orleans

5http://www.scholastic.com/browse/article.jsp?id=3754772

Page 6: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

NGC’s Shipyard

6

• 20,000 employees in Ship Construction• Caused over US$1 billion in damage for the company • Brought two of the nation’s largest shipyards to a

standstill

Page 7: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Recovering IT in a Disaster

• How to adapt when the business continuity plan; inadequate public infrastructure

• Reexamine our processes for preparing disaster plans

• Processes for assessing preparedness and response after a disaster or a near-disaster.

7

Page 8: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Northrop Grumman Corporation

• Products : electronics, aerospace, and shipbuilding

• Customers: government and commercial customers worldwide

• Major business: – Ship construction - large military vessels– Revenue: US$5.7 billion in 2005– Customers: DoD and Navy– 12,900 employees at Mississippi; – 7,100 employees at the New Orleans

8

Page 9: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Preparation for Hurricane

• Hurricane is nothing new to ship industry– September 04 – Hurricane Ivan– July 05 - Hurricane Dennis

• A bigger one is heading in – August 05

• 11 people dead, over US$1billion in damage in Florida

9http://www.fema.gov/hazard/flood/recoverydata/katrina/katrina_about.shtm

Page 10: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Preparation for Hurricane• Data

– Data backups were sent to Iron Mountain (information management services)

– Double back up in Dallas• Servers

– power off– wrapped in plastic

• New backup generator – in secure location• Only one extranet alive (crucial the Navy and DoD)• Human

– Left the area

10

Page 11: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

The storm smashed

• NGC facilities were on the storm’s path• Communication failed • Extensive damage to shipyard and nearby

communities• Emergency command center – at Dallas

office – newly assembled emergency team is formed– Began to pull together the first stages of NGC

disaster recovery response

11

Page 12: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Damages• Collect digital images of damages• At Mississippi, lost

– 1,500 PC, 200 servers, 300 printers, 600 data input devices, and hundreds of two-way radios.

– communications closets, routers, switches, fiber and copper cables and wires.

– LAN / WAN / MAN – no longer worked• At New Orleans

– Infrastructures are there– AC systems are not working, hence servers are automatic

shutdown• A week after the storm, communication lines are

down again due to cars are driving over them

12

Page 13: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

First thing first• Not about restoring computer systems, but

restoring human resources• But most of the 20,000 employees were out

of contact • Tools

– Press releases– Corporate web site (67,000 hits in the weeks

after the storm )– Toll-free call in number

• Payroll through Wal-Mart and Western Union

13

Page 14: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Restoring IT infrastructure

• Electronic communication – nonexistent due to public communication infrastructure

• Communication through Black Berry can be used intermittently

• Two-way radios, walkie-talkies• Key members using satellite phones

– Required line-of-sight access to satellites• Later on, use wireless communication

14

Page 15: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Building new data center

• Hardware acquisition– 1500 desktop, 200 servers, etc– Contact supplier, reorder the latest orders.

• Incompatibilities between software and new hardware environment

• Inaccessible or difficult to find system documentation, e.g. license keys, server names, addressing schemes, login IDs

15

Page 16: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Restoring data and applications

• Some firms found that their back up data is partially unreadable

• For NGC, 2 backups : iron mountain and Dallas

• Lost some data on desktops or local machines

• Two weeks after Katrina – had a new data center; essential systems are up and running

16

Page 17: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Disaster preparedness• Common mistake : prepare for disasters specific

to their domain– financial institutions prepare for IT failures,– hospitals for pandemics– airliners for technical failures and sabotages.

• An alternative approach : consider a broader spectrum of disaster types, such as the generic disaster – economic, information, physical, human resource,

reputation, psychopathic, and natural disasters• Identify common characteristics of each disaster

categories, then construct the plan

17

Page 18: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

IT disaster preparedness framework• provide generic objectives and measurements, guidelines for

establishing IT disaster preparedness, • emphasize developing an IT continuity plan, identifying and

allocating critical resources, executing a business impact analysis, and maintaining, testing and training of the plan

• COBIT (Control Objectives for Information and Related Technology) – For operational IT and business managers – Focus on three core elements of IT governance: IT as an asset, IT-

related risks, and IT control structures. • ITIL (IT Infrastructure Library)

– focus is to improve the efficiency and effectiveness of IT services delivered to customers within the enterprise

– de facto standard for IT service management.

18

Page 19: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Lesson Learned

1. Keep Data and Data Centers Out of Harm’s Way

2. Don’t Assume the Public Infrastructure Will Be Available

3. Plan for Civil Unrest4. Assume Some People Will Not Be

Available 5. Leverage Your Suppliers as Critical Team

Members19

Page 20: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Lesson Learned

6. Expect the Unexpected7. Get Prepared – Crisis portfolio8. Establish a Strong Leadership Position9. Empower Decision Makers on the Team10.Exploit Fresh-Start Opportunities

20

Page 21: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

IT disaster recovery plan

21

Page 22: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

IT disaster recovery (DR) planNational Institute for Standards and Technology (NIST)  

• Goal – minimize any negative impacts to company operations

• By – identify critical IT systems and networks; – prioritize their recovery time objective;– delineates the steps needed to restart, reconfigure,

and recover them.

http://searchdisasterrecovery.techtarget.com/feature/IT-disaster-recovery-DR-plan-template-A-free-download-and-guide22

Page 23: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

IT Disaster Recovery Process

Perform Risk Assessment

Identify potential threats 

Determine important

infrastructure elements

23

Page 24: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Structure for an IT disaster recovery plan (1)

1. Develop the contingency planning policy statement. A formal policy provides the authority and guidance necessary to develop an effective contingency plan.

2. Conduct the business impact analysis (BIA). The business impact analysis helps to identify and prioritize critical IT systems and components.

3. Identify preventive controls. These are measures that reduce the effects of system disruptions and can increase system availability and reduce contingency life cycle costs.

4. Develop recovery strategies. Thorough recovery strategies ensure that the system can be recovered quickly and effectively following a disruption.

National Institute for Standards and Technology (NIST)24

Page 25: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Structure for an IT disaster recovery plan (2)

5. Develop an IT contingency plan. The contingency plan should contain detailed guidance and procedures for restoring a damaged system.

6. Plan testing, training and exercising. Testing the plan identifies planning gaps, whereas training prepares recovery personnel for plan activation; both activities improve plan effectiveness and overall agency preparedness.

7. Plan maintenance. The plan should be a living document that is updated regularly to remain current with system enhancements.

25National Institute for Standards and Technology (NIST)

Page 26: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

 Important IT disaster recovery planning considerations

• Senior management support. • Take the IT DR planning process

seriously. need the right information, and that information should be current and accurate

• Availability of standards. IT DR plans are NIST SP 800-34, ISO/IEC 24762, and BS 25777.

• Keep it simple• Review results with business units. • Be flexible

26

Page 27: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

 Reviewing the IT disaster recovery plan template (1)

• Information Technology Statement of Intent -- This sets the stage and direction for the plan.

• Policy Statement -- Very important to include an approved statement of policy regarding the provision of disaster recovery services.

• Objectives -- Main goals of the plan.• Key Personnel Contact Information -- Very

important to have key contact data near the front of the plan. It's the information most likely to be used right away, and should be easy to locate.

27

Page 28: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

 Reviewing the IT disaster recovery plan template (2)

• Plan Overview -- such as updating.• Emergency Response -- Describes what needs to be

done immediately following the onset of an incident.• Disaster Recover Team-- Members and contact

information of the DR team.• Emergency Alert, Escalation and DRP Activation --

Steps to take through the early phase of the incident, leading to activation of the DR plan.

• Media, Insurance, Financial and Legal Issues

28

Page 29: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

29

Description Likelihood and Impact

Detection, how will we know it has happened

Immediate Action

Later Action

Effect on Users

Mitigation and Contingency (currently in place)

Single Disk Failure

Medium Nagios Warning

Replace failed disk in RAID volume.

Order new disks. Have existing disks destroyed.

No effect Nagios monitoring of RAID volumes. Keep replacements drives available.

Multiple Disk Failure

Low Nagios Warning

Replace failed disks in RAID volume. Restore from hot backup.

Order new disks. Have existing disks destroyed.

No effect (failover)

Nagios monitoring of RAID volumes. Keep replacements drives available.

Unauthorised modification of content

Low Periodic Auditing of logs. Monitoring of application

Restore modified content.

Repair security breach. Determine root vulnerability.

Low effect on users.

Determine root vulnerability. Repair vulnerability.

www.questionpro.com/.../SA-Disaster-Recovery-Plan-120D.doc

Page 30: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

30

Description Likelihood and Impact

Detection, how will we know it has happened

Immediate Action

Later Action Effect on Users

Mitigation and Contingency (currently in place)

Data loss Low Nagios Warning

Restore data from hot or offsite backup.

No later action necessary.

Users will not have access to their data.

Hot and offsite backups in place.

Multiple machine failure

Low Nagios Warning

Repair machine, replace machine with hot backup machine.

Repair machine, replace machine with hot backup machine. Order new hot backup machine.

Low effect (failover). Performance will be compromised.

Monitor machine health with Nagios.

Software failure

Medium Nagios Warning

Update/repair software.

Update/repair software.

Low effect or no access to software.

Update software to latest stable version.

www.questionpro.com/.../SA-Disaster-Recovery-Plan-120D.doc

Page 31: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Classic Mistakes

IT Project Management: Infamousfailures, Classic mistakes, and best practices

MIS Quarterly 2007, R. Ryan Nelson

31

Page 32: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Classic Mistakes

• People• Process • Product• Technology

32

Page 33: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Classic Mistakes : People

• Undermined motivation• Individual capabilities of the team members• Failure to take action to deal with a problem

employee• Adding people to a late project

33

Page 34: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Classic Mistakes : Process

• Waste time on fuzzy front end, approval and budgeting, aggressive schedule later

• human tendency to underestimate and produce overly optimistic schedules

• Insufficient risk management– lack of sponsorship, changes in stakeholder

commitment, scope creep, and contractor failure.

• Risks from outsourcing and offshoring– QA, interfaces, unstable requirements

34

Page 35: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Classic Mistakes : Product

• Requirements gold-plating– unnecessary product size and/or characteristics

• Developer gold-plating– Developers try out new technology / features

• Feature creep– +/- 25% change in requirements over lifetime

35

Page 36: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Classic Mistakes : Technology

• Silver-bullet syndrome– Expect new technology to solve all problems

• 4GL, CASE tools, OOD• Overestimated savings from new tools or

methods– Did not account for learning curve and unknown

unknowns• Switching tools in the middle of a project

– Version upgrade

36

Page 37: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Findings from empirical Study – 99 projects -

• Finding 1 – People (43%), Process (45%), Product (8%),

Technology (4%)• Scope creep

– Not a top 10, although ¼ of the projects faced scope creep and manager should watch out for it.

• Top 3 mistakes found in ½ of the projects– Should have focused more on estimation,

scheduling, stakeholders management, risk management

37

Page 38: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

38

Page 39: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

Classic Mistakes vs Best Practices

39

Page 40: Recovering IT in a Disaster &  Classic Mistakes

University of Southern California

Center for Systems and Software Engineering

References

• IT Project Management: Infamous Failures, Classic Mistakes, and Best Practices

• Recovering IT in a Disaster: Lessons from Hurricane Katrina

40