Upload
clyde-pierce
View
214
Download
1
Embed Size (px)
Citation preview
IntroductionIntroduction
CSC/ECE 772: Survivable Networks
Spring, 2009, Rudra Dutta
Copyright Rudra Dutta, NCSU, Spring, 2009 2
MotivationMotivation Failures can affect our lives fairly directly
– Internet evolution: lab curiosity, mil/gov, educational, commercial, business, social, lifeline/ubiquitous
Modern society critically depends on communication networks– Similar to power grid, transportation system, water distribution
Mission critical business functions must be available 24/7– Web-based transaction systems– 1-800 services– e-commerce
Government services, emergency (911) services Scientific projects (e-Science, etc) Everyday communication services
– BT will switch entirely to IP by 2012 (Spectrum, 1/2007)
Survivability must be foremost consideration in network design, not an afterthought
Copyright Rudra Dutta, NCSU, Spring, 2009 3
Failure EventsFailure Events Link failures - fiber cuts Failure of active components inside network
equipment– Transmitters, receivers, controllers– Individual channel failures (in a WDM system)
Node failures - due to catastrophic event– Rare events, but cause widespread disruption
Software failures - due to immense complexity– Usually dealt with by using proper software design
techniques– Hard to protect against in the network
Copyright Rudra Dutta, NCSU, Spring, 2009 4
Failure CausesFailure Causes Human error - most common cause
– “Backhoe effect”, operator errors, · · ·
Natural events - floods, snow storms, earthquakes– 1997 solar storm caused Telstar failure– 1988 fire at Hindsdale CO– 1999 damage from hurricane Floyd– 2002 fire melted Verizon fiber cable
Animals !– Rodents gnaw on cable jackets– Sharks bite undersea cables (TAT-8)
Sabotage - terrorism (9/11) Wear and tear
Copyright Rudra Dutta, NCSU, Spring, 2009 5
Chances of FailureChances of Failure Fiber optic cables are critical components: we know to
– ...physically protect cables,– ...bury them suitably deep,– ...be careful when digging,– So why do fiber cables get cut at all?
Similar issues in operating many large-scale systems:– Nuclear reactors, water systems, air traffic control / airplanes– Lay person: baffled when things go wrong– Insider: knows how many things can go wrong
Statistical certainty of fairly high rate of failures ! Average life of fiber span - 228 years
– But consider laid fiber-miles
Copyright Rudra Dutta, NCSU, Spring, 2009 6
Service Outage StatisticsService Outage Statistics
Copyright Rudra Dutta, NCSU, Spring, 2009 7
Engineering Fault ToleranceEngineering Fault Tolerance Failures may be rare or common, but are
inevitable Should not be baffling !
– (At least not to the designer of system !)– Should in fact, be predictable (at least statistically)
Must engineer for failure - common to many disciplines
Most repair times are much larger than acceptable restoration times– Restoration of service must be engineered to operate
with active failure in network
Copyright Rudra Dutta, NCSU, Spring, 2009 8
Outage DurationOutage Duration Revenue loss
– Loss of business (e.g., voice-calling revenue)– Default on SLAs
Business disruption– Regular business impacted– Societal impact/risks (travel, education, financial
services, 911)– Lawsuits, bankruptcies
Network dynamics– Application/TCP session timeouts, router connectivity
loss– Overloading
Copyright Rudra Dutta, NCSU, Spring, 2009 9
Outage EffectsOutage EffectsTarget Range Duration Main Effects
Protection Switching
< 50 ms Service “hit”
TCP sees no impact
1 50 - 200 Few voiceband disconnects
ATM cell rerouting may start
2 200-2000 Some switched connections drop
TCP protocol backoff
3 2 - 10 s Switched circuit services drop (X.25)
TCP session timeouts
“webpage not available” errors
Affects router hello protocol
Copyright Rudra Dutta, NCSU, Spring, 2009 10
Outage EffectsOutage EffectsTarget Range Duration Main Effects
4 10s - 5 min Calls and data sessions terminated
TCP/IP applications timeout
Users attempt mass redials
Routers issue LSAs
Topology update, network-wide resynch
“Undesirable” 5 - 30 min Routers under heavy reattempts load
Minor business/societal impact
Noticeable “Internet brownout”
“Unacceptable” > 30 min Regulatory reporting required
Major societal impacts/risks, headlines
SLA clauses triggered, lawsuits
Copyright Rudra Dutta, NCSU, Spring, 2009 11
Planning Graded Fault TolerancePlanning Graded Fault Tolerance Instantaneous recovery from most significant/frequent
failures– Eliminate human involvement - device level
Fast recovery from other significant or frequent failures– Also automatic - device or system
Reasonably fast recovery from next tier of failures– Automated, but may be system / software
Least likely tier - repair and recovery plans– Includes manual repair, must also plan for liability
Copyright Rudra Dutta, NCSU, Spring, 2009 12
Mechanisms for Fault ToleranceMechanisms for Fault Tolerance Carefully design-in specific amounts of spare
capacity– spare links/channels/equipment– bumping low priority traffic
Design network topology for physical diversity– bi-connected topology (or better)– diverse routing– shared risk link group (SRLG) concept
Embed real-time mechanisms to develop/implement “patch plan”– appropriate protocols and algorithms– cross-layer interactions
Copyright Rudra Dutta, NCSU, Spring, 2009 13
Stages of Failure Plan OperationStages of Failure Plan Operation
Failure detection Failure localization Failure recovery Failure repair
Almost certainly startingfrom device layer
Cooperation betweendevice and software
Software
Human
Copyright Rudra Dutta, NCSU, Spring, 2009 14
SummarySummary Faults are real, must plan to address Faults are diverse, plan must be diverse Fault tolerance is a system concept, not add-on
– Must plan at various levels– At each level, must be appropriate response
Must address together with network design problem, hard to achieve after the fact