22
Maintaining Business Continuity Maintaining Business Continuity After Internal and External After Internal and External Incidents Incidents Greg Schaffer, CISSP Greg Schaffer, CISSP Director of Network Services Director of Network Services Middle Tennessee State Middle Tennessee State University University

Maintaining Business Continuity After Internal and External Incidents Greg Schaffer, CISSP Director of Network Services Middle Tennessee State University

Embed Size (px)

Citation preview

Maintaining Business Continuity After Maintaining Business Continuity After Internal and External IncidentsInternal and External Incidents

Greg Schaffer, CISSPGreg Schaffer, CISSP

Director of Network ServicesDirector of Network Services

Middle Tennessee State UniversityMiddle Tennessee State University

Copyright Greg Schaffer 2008. This work isthe intellectual property of the author.

Permission is granted for this material to beshared for non-commercial, educationalpurposes, provided that this copyrightstatement appears on the reproducedmaterials and notice is given that the

copying is by permission of the author. Todisseminate otherwise or to republish

requires written permission from the author.

Our Story Begins Like Many….Our Story Begins Like Many…. It was late in the afternoon one weekday when It was late in the afternoon one weekday when

suddenly alarms sounded in the NOC. It was clear suddenly alarms sounded in the NOC. It was clear SOMETHING had happened, because SOMETHING had happened, because connectivity was shattered across campus. connectivity was shattered across campus. Students could not access online classes, Students could not access online classes, purchase orders could not be processed, email purchase orders could not be processed, email would not go through…would not go through…

BUSINESS BUSINESS DISDISCONTINUITYCONTINUITY

Troubleshooting the ProblemTroubleshooting the Problem It was relatively easy to pinpoint what wasn’t It was relatively easy to pinpoint what wasn’t

talking to what.talking to what. The fact that many things were not talking to The fact that many things were not talking to

other many things indicated that more than one other many things indicated that more than one “thing” was affected.“thing” was affected.

Check of devices indicated the problem was Check of devices indicated the problem was not equipment but at physical layer.not equipment but at physical layer.

It was clear that this was going to take SOME It was clear that this was going to take SOME TIME to fix!TIME to fix!

Location, Location, LocationLocation, Location, Location The relative location of the physical layer The relative location of the physical layer

issue was determined to be at or on the site issue was determined to be at or on the site of new stadium construction.of new stadium construction.

However, there was no initial indications of However, there was no initial indications of anything wrong.anything wrong.

When asked, the construction workers said When asked, the construction workers said they had not been digging…they had not been digging…

BUTBUT ……neglected to mention they had been pile neglected to mention they had been pile

driving rocks to prepare a trench for a new driving rocks to prepare a trench for a new water line.water line.

The concrete encased conduits were The concrete encased conduits were damaged by the equipment.damaged by the equipment.

The area was excavated to reveal what we The area was excavated to reveal what we hoped was minimal damage…hoped was minimal damage…

Minimal Damage?!Minimal Damage?!

Getting Services UpGetting Services Up While the extent of the physical damage wasn’t While the extent of the physical damage wasn’t

clear until complete excavation was done the clear until complete excavation was done the next morning it was clear that there was next morning it was clear that there was enough physical damage to assume that the enough physical damage to assume that the conduits would not be usable for replacement conduits would not be usable for replacement fiber optics.fiber optics.

There were redundant fiber cables between There were redundant fiber cables between data centers that took different routes across data centers that took different routes across campus…campus…

Forming the PlanForming the Plan ……except for one portion, which happened to be except for one portion, which happened to be

the pulverized area! the pulverized area! A plan was needed to restore A plan was needed to restore

communications…fastcommunications…fast The plan: The plan:

– access manholes on either end of the damage and access manholes on either end of the damage and splice new fibers in manholessplice new fibers in manholes

– run fibers temporarily on the road, and close the run fibers temporarily on the road, and close the road to all traffic (planned anyway)road to all traffic (planned anyway)

Finding Manhole DifficultFinding Manhole Difficult

Eventually Circuits Back UpEventually Circuits Back Up

But Almost Down Again!But Almost Down Again! Graduation was that SaturdayGraduation was that Saturday Road opened for visitorsRoad opened for visitors Temporary fibers had vehicles driving over Temporary fibers had vehicles driving over

them most of the day!them most of the day! Fibers held, but needless to say they would Fibers held, but needless to say they would

not be reused…not be reused…

Post MortemPost Mortem Eventually (nearly one month later) a manhole was Eventually (nearly one month later) a manhole was

constructed around the break, and new fibers constructed around the break, and new fibers pulled through the repaired area and splicedpulled through the repaired area and spliced

Despite “normal” controls (“Tennessee One Call”, Despite “normal” controls (“Tennessee One Call”, conduits encased in concrete, redundant fibers, conduits encased in concrete, redundant fibers, etc.) “Bad Stuff” happenedetc.) “Bad Stuff” happened

Bad Stuff = Good LessonsBad Stuff = Good Lessons

Operations Security ControlsOperations Security Controls PreventativePreventative DetectiveDetective CorrectiveCorrective DirectiveDirective RecoveryRecovery DeterrentDeterrent CompensatingCompensating

CISSP CBKCISSP CBK

Preventive/DetectivePreventive/Detective Failed:Failed:

– Tennessee One Call (dirt covered markings)Tennessee One Call (dirt covered markings)– Hardened Physical PathsHardened Physical Paths

Worked (but after the fact)Worked (but after the fact)– Network monitoringNetwork monitoring– Help desk reportingHelp desk reporting– DocumentationDocumentation

And Keep Manhole Uncovered!And Keep Manhole Uncovered!

Corrective/DirectiveCorrective/Directive WorkedWorked

– Emergency Web CommunicationsEmergency Web Communications– Temporary fiber construction (temporary corrective Temporary fiber construction (temporary corrective

control for Business/Mission Continuity)control for Business/Mission Continuity)– ShovelShovel

FailedFailed– Blocking car and truck trafficBlocking car and truck traffic

RecoveryRecovery More of a longer term approach to prevent the More of a longer term approach to prevent the

same occurrencesame occurrence Redundant fiber between data centersRedundant fiber between data centers Must also consider separate building Must also consider separate building

entrancesentrances Cost of solution vs cost of downtime analysisCost of solution vs cost of downtime analysis

Deterrent/CompensatingDeterrent/Compensating Worked:Worked:

– Penalty/InsurancePenalty/Insurance– Temporary fiber runTemporary fiber run– Cutting of ductsCutting of ducts– Creation of new manholeCreation of new manhole

FinallyFinally It ended up being a late night, hampered by many It ended up being a late night, hampered by many

events. Our DR/BC plan did not specifically events. Our DR/BC plan did not specifically address this problem...NOR SHOULD IT HAVE. address this problem...NOR SHOULD IT HAVE. A good DR/BC plan is flexible and adaptive. The A good DR/BC plan is flexible and adaptive. The necessary resources were mobilized quickly necessary resources were mobilized quickly based on existing DR/BC plans. What could have based on existing DR/BC plans. What could have been a very large disaster goes down as a been a very large disaster goes down as a downtime that lasted 10 hours. downtime that lasted 10 hours.