28
Avici Company Confidential Reliable Routing for the Internet Scott Poretsky Avici Systems, Inc. June 3, 2002 Core Router Testing for High Availability

Reliable Routing for the Internet Avici Company Confidential Scott Poretsky Avici Systems, Inc. June 3, 2002 Core Router Testing for High Availability

Embed Size (px)

Citation preview

Avici Company ConfidentialReliable Routing for the Internet

Scott Poretsky

Avici Systems, Inc.

June 3, 2002

Core Router Testing for High Availability

Core Router Testing for High Availability

Architecture for the 21st Century Network

IP Network Availability Test Coverage for 99.999% Availability Commercial Test Equipment Requirements

OutlineOutline

Architecture for the 21st Century Network

IP Network AvailabilityIP Network Availability

Architecture for the 21st Century Network

High Reliability = More RevenueHigh Reliability = More Revenue

Reliability is the single biggest criteria in selecting an ISP, according to Interactive Week/Telechoice

ISP Customer Survey

4

4.14.2

4.3

4.4

4.5

4.6

4.7

4.8

Reliability Value Performance CustomerService

ProvisioningSpeed

Re

lati

ve

Im

po

rta

nc

eISP Customer Survey

4

4.14.2

4.3

4.4

4.5

4.6

4.7

4.8

Reliability Value Performance CustomerService

ProvisioningSpeed

Re

lati

ve

Im

po

rta

nc

e

New IP services demand higher levels of network reliability

Architecture for the 21st Century Network

High Reliability = More ProfitHigh Reliability = More Profit

Compensation for poor router reliability through redundancy and interconnects can increase network cost by up to 50%

VOIP

Core Layer(Backbone Router)

DSLAM L3/4Switch

CMTS GGSN L3/4Switch

DirectConnects

Aggregation Layer(Hub Router)

DirectConnects

ServiceProvider

Peer

ServiceProvider

Peer

EdgeLayer

AccessDevices

VOIP

Core Layer(Backbone Router)

DSLAM L3/4Switch

CMTS GGSN L3/4Switch

DirectConnects

Aggregation Layer(Hub Router)

DirectConnects

ServiceProvider

PeerPeering

EdgeLayer

AccessDevices

IP Backbone

Architecture for the 21st Century Network

DefinitionsDefinitions

Reliable Capable of being dependable (Webster)

Availability Measure of Reliability using router/switch Uptime

Mission Reliability Mean Time Between Critical Failures (MTBCF) or the average

time between hardware or software failures that interrupt service (the mission)

Maintenance Reliability Mean Time Between Failures (MTBF) or the average time

between hardware failures that require corrective maintenance actions

Defects Per Million (DPM) Measure of downtime equal to (1 – Availability) x 106

Architecture for the 21st Century Network

CrashDump Time Boot TimeProtocol

ConvergenceTime

Total Time to Restore Router/Switch After a Software Failure

Not to ScaleSoftwareFailureOccurs

FullOperationRestored

Time

Mission Reliability

Contributing Factors for Availability Contributing Factors for Availability

Maintainer Response Time Boot TimeProtocol

ConvergenceTime

Total Time to Restore a Module After a Hardware Failure

Not to Scale

Removal and Replacement

Time

HardwareFailureOccurs

Time

Maintenance Reliability

FullOperationRestored

Image Upgrade Time

Architecture for the 21st Century Network

The Availability GoalThe Availability Goal

The Goal – 99.999% Router Availability The Reality – 99.9% Router Availability Features to achieve 99.999% availability.

Non-Stop Routing Graceful Restart

What if testing could could improve Mission Reliability to achieve 99.999% Availability in absence of new features?

What if the addition of these new features would then achieve 99.9999% Availability?

Architecture for the 21st Century Network

Test CoverageTest Coverage

Architecture for the 21st Century Network

Isolated testing of protocols Functionality Conformance Interoperability Scaling

Forwarding Performance in the absence of protocols. Disadvantages

Operational environment is not tested Operational conditions are not tested The router under test is not completely stressed.

Deployed routers run multiple protocols simultaneously.

Traditional Test CoverageTraditional Test Coverage

Architecture for the 21st Century Network

Stress Testing Longevity Testing Convergence Testing Network-Specific Topology Testing Automated Regression Testing

Test Program for 99.999% AvailabilityTest Program for 99.999% Availability

Architecture for the 21st Century Network

Stress TestingStress Testing Simultaneous configuration and scaling of multiple protocols.

BGP, IGP MPLS-TE, LDP (optional) MBGP, PIM-SM, MSDP (optional)

Traffic Forwarding Line Rate Traffic Forwarding Overutilize links Enable QoS

Network Instability Repeated Route Flaps Link Loss Tunnel Reroutes (optional)

Serviceability Repeated SNMP Gets Logging Enabled Debug Enabled Telnet with SHOW commands (stressful and invalid)

Architecture for the 21st Century Network

Stress ConfigurationStress Configuration

Router Under Test

NeighborRouter

NeighborRouter

OptionalNeighbor

Router for Tunnel

Reroutes

Test Equipment

Test Equipment

Test Equipment

Architecture for the 21st Century Network

Stress Execution GuidelinesStress Execution Guidelines

Configure ECMP, Parallel Paths, and Composite Links between routers

Use Live BGP Feed for Route Table Mix traffic types across links (IP Unicast, IP Multicast,

MPLS) One neighbor router should be a different vendor to

show interoperability under stress Run Stress for many days (if the router lasts that

long)

Router should experience more in a couple of days then it likely would in its operational lifetime.

Architecture for the 21st Century Network

Typical Stress MetricsTypical Stress Metrics

Flap 1 million BGP routes per hour Forward 10 Terabits of data per hour Perform 100,000 SNMP Gets per hour Simulate 100 fiber cuts per hour (use every remote

interface) Along with

Full BGP Table Full IGP Table Full Multicast Cache Required MPLS-TE Tunnels (protection optional) Required LDP FECs

Enable Logging and Protocol Debug

Architecture for the 21st Century Network

Longevity TestingLongevity Testing

Similar to Stress Testing, but more operational (less stressful) conditions injected over many weeks.

Simultaneous configuration and scaling of multiple protocols Traffic Forwarding More realistic Network Instability More typical Serviceability actions

Use Live Internet feed.

Architecture for the 21st Century Network

Network Convergence -The point in time at which all nodes in a network have updated

their routing tables for a route entry change (new, withdrawal, or modification)

Protocol Convergence -The point in time in which a single node updates its routing table

and advertises the route table change to its peer in a routing protocol advertisement (or update) message.

Route Convergence - The point in time in which a single node updates its routing table

and reroutes traffic out the new interface.

Route Convergence is the common Router Benchmark.

Convergence TermsConvergence Terms

Architecture for the 21st Century Network

Large number of Protocols in which Convergence is important.

Number of conditions that can impact results.

Technical difficulty in testing convergence of one protocol due to flap or instability of another protocol.

Convergence Test IssuesConvergence Test Issues

Architecture for the 21st Century Network

Interface shutdown on Local Interface on Remote Interface

Fiber Pull on Local Interface on Remote Interface

Peer removal via CLI on Local router on Peer router

Peer node failure Route Table changes

Route Withdrawal Route Flap Next-Hop Change Metric Change Dynamic Constraint Change Policy Change

All conditions must be tested because different results can be produced.

Convergence Test ConditionsConvergence Test Conditions

Architecture for the 21st Century Network

Network-Specific Topology TestingNetwork-Specific Topology Testing

Large network with many routers (e.g. 10) Use multiple vendors for interoperability/functionality

testing. Multiple protocols configured in deployment scenario Run test cases to match deployment scenario

Architecture for the 21st Century Network

Addition of bug fixes/new features put previously working features at risk.

Regression testing ensures that the previously working features still work.

As the number of releases with new features grow it is more difficult to provide complete regression coverage through manual testing (increasingly labor intensive).

Automated regression testing enables more coverage in less time. Automation is typically achieved using TCL scripts. Configuration:

Automated Regression TestingAutomated Regression Testing

Router Under Test

Test Equipment

Architecture for the 21st Century Network

Commercial Test Equipment Commercial Test Equipment RequirementsRequirements

Architecture for the 21st Century Network

Test Equipment fails to meet today’s requirements for testing 99.999% Availability.

Router vendors have been forced to develop their own specialized test tools.

Carriers have been forced to use the router vendor test tools.

Test Equipment vendors must respond to the challenge today.

The State of the UnionThe State of the Union

Architecture for the 21st Century Network

Stress Testing RequirementsStress Testing Requirements

Maintain BGP Sessions and IGP Adjacencies Flap BGP Routes Signal and maintain RSVP-TE tunnels Distribute LDP FECs Signal and maintain Multicast Groups Perform SNMP GETs and check validity Forward Traffic (IP Unicast, IP Multicast, and MPLS)

Make the network seem much bigger than it really is without having to obtain hundreds of routers.

Architecture for the 21st Century Network

Required Protocol Emulation/ Required Protocol Emulation/ Conformance Suites CoverageConformance Suites Coverage Routing Protocols

BGP OSPF, ISIS OSPF-TE, ISIS-TE

RSVP-TE Fast Reroute Standby Tunnels Ingress, Mid-Point, Egress

LDP RFC 2547 Layer 3 VPNs Martini Layer 2 VPNs P and PE LDP over RSVP

Multicast MBGP PIM-SM MSDP

Architecture for the 21st Century Network

Protocol Emulation RequirementsProtocol Emulation Requirements

Run any protocols in combination on the same interface Forward traffic for emulated protocols Protocol Emulation on any interface type – GigE,

10GigE, and POS (including 192c). Scaling

BGP Sessions >500/system, >100/interface BGP Routes >3M/system, >500K/session MPLS-TE Tunnels >10K - Ingress, Mid-Point, Egress FECs >10K

Load external BGP table for advertisement Controlled BGP Route Flapping

Architecture for the 21st Century Network

Commercial test equipment vendors offer protocol conformance TCL suites. Test Case coverage must be improved within each

suite Interaction between protocols must be tested Need each script to test multiple interfaces (4 or

more)

Full Protocol Coverage Multicast protocols have been the “forgotten son”

Automated Regression RequirementsAutomated Regression Requirements

Architecture for the 21st Century Network

System RequirementsSystem Requirements

Multiple ports per chassis (>32) Automated Convergence measurement Automated reroute/failover measurement Support for ECMP and Composite Links System/Protocol Stability For Many Days Ability to store GUI configuration for repeatability. Ability to TCL script any GUI test case.