14
Introduction Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Embed Size (px)

Citation preview

Page 1: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

IntroductionIntroduction

CSC/ECE 772: Survivable Networks

Spring, 2009, Rudra Dutta

Page 2: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 2

MotivationMotivation Failures can affect our lives fairly directly

– Internet evolution: lab curiosity, mil/gov, educational, commercial, business, social, lifeline/ubiquitous

Modern society critically depends on communication networks– Similar to power grid, transportation system, water distribution

Mission critical business functions must be available 24/7– Web-based transaction systems– 1-800 services– e-commerce

Government services, emergency (911) services Scientific projects (e-Science, etc) Everyday communication services

– BT will switch entirely to IP by 2012 (Spectrum, 1/2007)

Survivability must be foremost consideration in network design, not an afterthought

Page 3: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 3

Failure EventsFailure Events Link failures - fiber cuts Failure of active components inside network

equipment– Transmitters, receivers, controllers– Individual channel failures (in a WDM system)

Node failures - due to catastrophic event– Rare events, but cause widespread disruption

Software failures - due to immense complexity– Usually dealt with by using proper software design

techniques– Hard to protect against in the network

Page 4: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 4

Failure CausesFailure Causes Human error - most common cause

– “Backhoe effect”, operator errors, · · ·

Natural events - floods, snow storms, earthquakes– 1997 solar storm caused Telstar failure– 1988 fire at Hindsdale CO– 1999 damage from hurricane Floyd– 2002 fire melted Verizon fiber cable

Animals !– Rodents gnaw on cable jackets– Sharks bite undersea cables (TAT-8)

Sabotage - terrorism (9/11) Wear and tear

Page 5: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 5

Chances of FailureChances of Failure Fiber optic cables are critical components: we know to

– ...physically protect cables,– ...bury them suitably deep,– ...be careful when digging,– So why do fiber cables get cut at all?

Similar issues in operating many large-scale systems:– Nuclear reactors, water systems, air traffic control / airplanes– Lay person: baffled when things go wrong– Insider: knows how many things can go wrong

Statistical certainty of fairly high rate of failures ! Average life of fiber span - 228 years

– But consider laid fiber-miles

Page 6: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 6

Service Outage StatisticsService Outage Statistics

Page 7: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 7

Engineering Fault ToleranceEngineering Fault Tolerance Failures may be rare or common, but are

inevitable Should not be baffling !

– (At least not to the designer of system !)– Should in fact, be predictable (at least statistically)

Must engineer for failure - common to many disciplines

Most repair times are much larger than acceptable restoration times– Restoration of service must be engineered to operate

with active failure in network

Page 8: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 8

Outage DurationOutage Duration Revenue loss

– Loss of business (e.g., voice-calling revenue)– Default on SLAs

Business disruption– Regular business impacted– Societal impact/risks (travel, education, financial

services, 911)– Lawsuits, bankruptcies

Network dynamics– Application/TCP session timeouts, router connectivity

loss– Overloading

Page 9: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 9

Outage EffectsOutage EffectsTarget Range Duration Main Effects

Protection Switching

< 50 ms Service “hit”

TCP sees no impact

1 50 - 200 Few voiceband disconnects

ATM cell rerouting may start

2 200-2000 Some switched connections drop

TCP protocol backoff

3 2 - 10 s Switched circuit services drop (X.25)

TCP session timeouts

“webpage not available” errors

Affects router hello protocol

Page 10: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 10

Outage EffectsOutage EffectsTarget Range Duration Main Effects

4 10s - 5 min Calls and data sessions terminated

TCP/IP applications timeout

Users attempt mass redials

Routers issue LSAs

Topology update, network-wide resynch

“Undesirable” 5 - 30 min Routers under heavy reattempts load

Minor business/societal impact

Noticeable “Internet brownout”

“Unacceptable” > 30 min Regulatory reporting required

Major societal impacts/risks, headlines

SLA clauses triggered, lawsuits

Page 11: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 11

Planning Graded Fault TolerancePlanning Graded Fault Tolerance Instantaneous recovery from most significant/frequent

failures– Eliminate human involvement - device level

Fast recovery from other significant or frequent failures– Also automatic - device or system

Reasonably fast recovery from next tier of failures– Automated, but may be system / software

Least likely tier - repair and recovery plans– Includes manual repair, must also plan for liability

Page 12: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 12

Mechanisms for Fault ToleranceMechanisms for Fault Tolerance Carefully design-in specific amounts of spare

capacity– spare links/channels/equipment– bumping low priority traffic

Design network topology for physical diversity– bi-connected topology (or better)– diverse routing– shared risk link group (SRLG) concept

Embed real-time mechanisms to develop/implement “patch plan”– appropriate protocols and algorithms– cross-layer interactions

Page 13: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 13

Stages of Failure Plan OperationStages of Failure Plan Operation

Failure detection Failure localization Failure recovery Failure repair

Almost certainly startingfrom device layer

Cooperation betweendevice and software

Software

Human

Page 14: Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

Copyright Rudra Dutta, NCSU, Spring, 2009 14

SummarySummary Faults are real, must plan to address Faults are diverse, plan must be diverse Fault tolerance is a system concept, not add-on

– Must plan at various levels– At each level, must be appropriate response

Must address together with network design problem, hard to achieve after the fact