20
Multi-Layer Survivability Johan Meijen, Eve Varma, Ren Wu and Yufei Wang W H I T E P A P E R OPTICAL NETWORKING GROUP

Multi-Layer Survivabilitygrover/mesh_networking/wp008[1].pdf · titled "Multi-Layer Survivability," we specifically ... MSP SNCP MS-SPRing Distributed ... Ring. SDH Multiplex Section

  • Upload
    hanga

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

Multi-Layer Survivability

Johan Meijen, Eve Varma, Ren Wu and Yufei Wang

W H I T E P A P E ROPTICAL NETWORKING GROUP

WP008.qxd 11/29/1999 09:49 Page 1

INTRODUCTION

Networks are increasingly migrating toward theuse of multiple technologies integrated intolayers, with client-server relationships existingbetween layers. The technologies most widelyused in such layers are DWDM/OpticalTransport Networking (OTN), SDH/SONET,ATM, and IP technologies. A wide range oflayerings is possible, involving differentclient/server relationships, some of which areillustrated in Figure 1. Depending on thechoices made by particular operators, a givennetwork might employ all, or a sub-set, of theselayerings.

Survivability refers to the ability of a network tomaintain an acceptable level of service during anetwork or equipment failure. Multi-layersurvivability refers to the possible nesting ofsurvivability schemes among subtendingnetwork layers, and the way in which theseschemes interact with each other. This paper recommends particular survivabilitystrategies for multi-layer networks. This paperdoes not focus on multi-layer survivabilityinvolving innovative, evolving mechanisms forIP networks.

BACKGROUND

In order to address multi-layer survivabilityissues, it is essential to first understand

fundamental survivability-related considerationsand characteristics. Within this section, we brieflyreview network operator considerations,deployment options, and the essential aspects ofvarious survivability architectures for variousnetwork layers. We also discuss a number of thekey factors influencing the choice of any particularsurvivability architecture.

OPERATOR CONSIDERATIONS

In deploying survivability strategies, operators needto consider the following issues:

• Network size and projected growth

• The portfolio of service offerings, as well as theQuality of Service (QoS) committed in ServiceLevel Agreements (SLAs)

• The policies regarding such matters as usage ofperformance-management counters (forunavailable second counts, protection-switchcounts, etc.), and their importance

• Whether the service offered traverses multipleoperator domains

Deployment of survivability schemes involvescertain cost/benefit trade-offs. These trade-offs arefairly easy to assess for single-layer networks; formulti-layer networks, however, they are moredifficult to assess, because multiple survivabilityschemes are deployed at the different networklayers. In addition, the presence of multiplesurvivability mechanisms in a network introduces anumber of technical and operations issues.

These issues involve:

• Interaction among single-layer survivabilityschemes

• Impact on service availability and performance

• Degree of standardization of survivabilityschemes, and the number of potential options

• Leverage of different layers to their maximumadvantage

From an operator perspective, it is critical to arriveat a coherent multi-layer survivability strategy thatenables the desired level of QoS and

LUCENT TECHNOLOGIES OPTICAL NETWORKING2

Figure 1: Illustration of Possible Layer Nesting

WP008.qxd 11/29/1999 09:49 Page 2

network-bandwidth optimization, and thatminimizes cost on a per-service basis. In the sectiontitled "Multi-Layer Survivability," we specificallyanalyze multi-layer survivability issue/benefittradeoffs, and identify the functionality required tosupport a coherent strategy.

DEPLOYMENT OPTIONS

A transport network can be characterized in termsof the applications supported by its constituentsub-networks.

We generally classify three tiers of transportnetworks:

• Tier 1 - core networkThe core or backbone network is used fortransporting high-capacity traffic; it oftenconsists of Add-Drop Multiplexers formingphysical rings, or Cross-Connects with physicalpoint-to-points

• Tier 2 - regional networkThe regional network is typically used fortransporting traffic in large urban ormetropolitan areas; it often consists of Add-Drop Multiplexers forming physical rings

• Tier 3 - local/access networkThe local/access network is used for collectingtraffic and transporting low-capacity traffic; itoften consists of physical rings and/or physicalpoint-to-points

Not all networks have three tiers; for example, asmall operator may restrict its network to two tiers,whereas a very large network could haveadditional tiers.

Survivability can be deployed on an end-to-endbasis, or in a cascaded and/or nested manner; or itcan involve a combination of these approaches. SeeFigure 2.

End-to-end survivability supports end-to-endtransport network survivability by means of asingle mechanism.

Cascaded survivability is often used withintransport networks where the survivabilitymechanisms are typically deployed in a chain. Inthis case, each sub-domain in the network supportsa particular survivability mechanism protectingonly the faults that occur within its sub-domain.

Nested survivability involves using multiplesurvivability mechanisms within a single sub-domain, where the fault coverage may differfor each mechanism. In general, the nesting isrestricted to two mechanisms, which aredesignated the primary and secondary levels ofdefense.For example:

• Consider a network that has cascadedsurvivability mechanisms as a primary level ofdefense, and that supports multiple services.When the availability of this network is notsufficient for a specific type of service, anadditional end-to-end survivability mechanismcan be deployed as a secondary level of defense.

LUCENT TECHNOLOGIES OPTICAL NETWORKING 3

Figure 2: End-to-end, Cascaded, and Nested Survivability

WP008.qxd 11/29/1999 09:49 Page 3

This secondary level of defense increasesnetwork availability and supports the requiredlevel of QoS (for example, an ATM service notconsistently carried on SONET-SDH transportnetworks)

• A protection mechanism can be selected on thebasis of its restoration time as a first level ofdefense against common faults such as fibercuts; however, it may be unable to protectagainst node failures or multiple faults.Restoration is often regarded as a second levelof defense in protecting against such failuremodes

• Two nested survivability mechanisms can beused where the primary mechanism, forexample, is intended to allow operatormaintenance at the section level, and thesecondary mechanism is used for end-to-endservice protection

As indicated earlier, a survivability mechanism maybe needed to protect against both single andmultiple faults. A particularly serious type of singlefault is that of a node failure that impacts all thesubtending subnetworks. To protect against nodefaults, protection mechanisms need to interworkusing Dual Node Interconnect architectures, asillustrated in Figure 3.

However, not all protection mechanisms supportcascaded Dual Node Interconnection; in particular,linear protection schemes supporting extra traffic

do not support Dual Node Interconnectarchitectures.

SURVIVABILITY ARCHITECTURES

This section introduces various survivabilityarchitectures (independent of any multi-layerenvironment considerations), and examines a setof generic approaches that can be implemented toaccommodate any particular technology. Includedare survivability architectures that can be usedwithin SONET/SDH, ATM, and IP; and potentiallyfor the OTN. These architectures are compared onthe basis of the following criteria: restorationspeed, capacity overbuild, flexibility/selectivity, andstandardization.

The protection architectures in this paper aredescribed using ITU-T terminology.

OverviewFrom a traditional transport-networking point ofview, survivability is commonly classified in termsof protection and restoration. From a historicalperspective, the distinction between protection andrestoration has been based on factors such astopology, recovery speed, and degree ofdeterministic behavior. For example, protectionhas been considered a topology-specific technique(linear or ring) that offers fast recovery, as follows:

• The protection mechanism is implemented in adistributed manner within the networkelements

LUCENT TECHNOLOGIES OPTICAL NETWORKING4

Figure 3: Dual Node Interconnect

1 Definition according ITU-T Recommendation M.20.

WP008.qxd 11/29/1999 09:49 Page 4

• Defects1 are used as triggers, resulting in a fastdetection time (for example, physical mediafaults can be detected within 10 milliseconds)

• The amount of capacity dedicated to protectionpurposes is known before the fault occurs,resulting in a quick transfer of traffic fromfaulty to good facilities. The SDH/SONET requirement in this regard isfor less then 50 milliseconds.

The traditional restoration approach, as employedwithin the telecommunications domain, hastypically been used for meshed-based topologies.This approach offered the advantage of optimizingcapacity utilization (shared spare resources), butthe restoration time was much slower than that forprotection.

The reasons for this are:

• The restoration mechanism was based on acentralized approach implemented within anexternal management system

• Alarms were used as triggers, causing adetection time in the order of seconds

• The management system had to re-provisionthe cross-connects, resulting in a transfer timein the order of seconds or even minutes,depending on the order in which the cross-connects were restored

However, the traditional distinctions betweenrestoration and protection are blurring, as shownby the following trends:

• Shared Protection Ring (SPRing) mechanismshave been used for several years. For exampleMultiplex Section Shared Protection Rings (MS-SPRing) architectures are now widelydeployed in SDH networks. The current SPRingmechanisms are restricted to use on theMultiplex Section layer, which is tightly coupledwith the physical media. However, the conceptcan be applied to layers not related to thephysical media. For example, the SPRingmechanism can be implemented on a per OTN-optical-channel basis, allowing them to bedeployed in meshed topologies as well as intypical rings

• Traditional mesh-based restoration mechanismsmay evolve. Faster restoration times could beachieved, for example, by means of distributed

pre-computation of the alternative routes priorto failures, use of defects as a trigger forsurvivability actions, and distributed activation

ProtectionProtection mechanisms can be dedicated or shared.Dedicated protection uses 100% of the protectioncapacity overbuild. The most commonly deployeddedicated protection mechanism is the 1+1architecture, which does not support extra traffic;versus the 1:1 architecture, which does supportextra traffic.

As mentioned earlier, a commonly deployedshared-protection mechanism is based on the ringtopology and is known as the Shared ProtectionRing (SPRing). The efficiency of a SPRingarchitecture depends on the traffic pattern.

RestorationSeveral key aspects may be used to differentiaterestoration mechanisms. When the highestpriorities are speed and scalability, the criticalaspects are related to the degree of centralizationversus the degree of distribution. As mentionedearlier, distributed restoration mechanisms may beused to restore service much more quickly thantraditional restoration schemes employed in thetelecommunications environment.

LUCENT TECHNOLOGIES OPTICAL NETWORKING 5

WP008.qxd 11/29/1999 09:49 Page 5

Survivability Mechanisms

LUCENT TECHNOLOGIES OPTICAL NETWORKING6

Table 1: Survivability Mechanism by Technology

TechnologyOTN

SDH/SONET

ATM

IP

CategoryDedicatedLinearProtection SharedRingProtection

RestorationDedicatedLinearProtection

SharedRingProtection

RestorationDedicated LinearProtection

Restoration

Restoration

MechanismsOMSPOCH-SNCP

OMS-SPRing

OCH-SPRing

Distributed MSP

SNCP

MS-SPRing

Distributed Dedicated VC, VG,VCG and VPGprotection

Distributed: PNNI

Distributed:Traditional IPrerouting

RemarkOptical Multiplex Section Protection.Optical Channel Sub-NetworkConnection Protection.Optical Multiplex Section SharedProtection Ring.Optical Channel Shared ProtectionRing.

SDH Multiplex Section Protection(MSP) is the same as SONET LineProtection.SDH Sub-Network ConnectionProtection (SNCP) is the same asSONET Path Protection.When deployed on a physical ring,SONET calls this a Uni-directionalPath Switched Ring (UPSR).SDH Multiplex Section SharedProtection Ring (MS-SPRing) is thesame as SONET Bi-directional LineSwitched Ring (BLSR).

Virtual Channel (VC), Virtual Path(VP), Virtual Channel Group (VCG),and Virtual Path Group (VPG)protection with or without extratraffic.Private Network to Network Interface(an ATM interface between switches)used to distribute information aboutthe state and structure of thenetwork, to establish circuits, toensure that reasonable bandwidthand quality-of-service contracts canbe established, and to provide forsome network managementfunctions.Includes rerouting algorithms such asOpen Shortest Path (OSPF), RoutingInformation Protocol (RIP), andBorder Gateway Protocol (BGP).

Table 1 summarizes important survivability mechanisms used within SONET/SDH, ATM, and traditional IP;and potentially for the OTN.

WP008.qxd 11/29/1999 09:49 Page 6

The following ratings apply for each of theattributes:

Restoration speed for physical media faults such asfiber cuts:

* Minutes.

** Seconds.

*** Hundreds of milliseconds.

**** Less than a hundred milliseconds.

Capacity overbuild, considering only the sparecapacity for the layer at which the survivabilitymechanism is exercised:

* 100% capacity overbuild: no capacity availablefor extra traffic or QoS differentiation.

** 100 % capacity overbuild: available for extratraffic or QoS differentiation.

*** Less then 100% capacity overbuild: availablefor extra traffic or QoS differentiation.

**** Capacity overbuild is minimized: supportsextra traffic or QoS differentiation.

Flexibility/selectivity:

* Linear topologies.

** Linear and/or ring topologies.

*** Can be used in meshed topologies (althoughnot optimized).

**** Optimized for meshed topologies.

Standardization:

* Standardization time frame is questionable,but is at least beyond 2000.

** Standardization expected beyond 2000.

*** Standardization expected by 2000.

**** Standardized.

LUCENT TECHNOLOGIES OPTICAL NETWORKING 7

Table 2: Aspects of Survivability Mechanisms

SurvivabilityMechanismOMSPOCH-SNCP OMS-SPRingOCH-SPRingOTN-DistributedRestorationMSPSNCPMS-SPRingSDH/SONET-DistributedRestorationATM protectionPNNITraditional IPrerouting

RestorationSpeed**********************************

******

CapacityOverbuild* or *** or ************* or **********

* or **********

Flexibility/Selectivity*********************

***********

Standardization

***********

*************

***********

Table 2 relates the most relevant and promising survivability mechanisms to four key attributes:restoration speed, capacity overbuild, flexibility/selectivity, and standardization (the more asterisks shownfor each attribute the better the performance; refer to the legend below the table for further details).

WP008.qxd 11/29/1999 09:49 Page 7

FACTORS INFLUENCING CHOICE OF

SURVIVABILITY ARCHITECTURES

One of the major factors influencing choice ofsurvivability architecture is the time required torestore service. Other factors include trafficpatterns, topology, and types of faults.

ITU-T Recommendation M.495, Figure 28, definesa protection-switching time model that includesthe following important parameters:

• Detection time - Time between the occurrenceof a network impairment and the detection of asignal fail (SF) or a signal degrade (SD)triggered by that network impairment

• Transfer time - The time interval from theconfirmation that an SF or SD requiresprotection switching operations until thecompletion of the protection-switchingoperations

The sum of the detection and transfer times iscalled restoration time.

For shared protection rings, the traffic pattern mayinfluence the efficiency of the protectionmechanism. Traffic patterns can be hubbed (singleor double hub), uniform, or adjacent.

In local/access sub-networks, the pattern isfrequently hubbed; in regional sub-networks, it ismore uniform and somewhat adjacent; in the coresub-network, the pattern is a mixture of uniformand adjacent.

The network structure and the traffic patternsinfluence the efficiency of the survivabilitymechanisms:

• Meshed sub-network structureIn (physical) meshed sub-networks consisting ofSDH/SONET cross-connect systems and point-to-point interconnecting links, dedicatedSNCP protection schemes can be used for

survivability with fast restoration times. Inaddition, or as an alternative, restoration canoffer a slower but much more efficientsurvivability mechanism. Shared-protectionrings can also be part of a meshed sub-networkstructure - for example, to interconnect largerstand-alone cross-connect systems - and may beused to build fast and efficient solutions forcertain applications. The current sharedprotection mechanisms are restricted to use onthe Multiplex Section layer, which is tightlycoupled with the physical media. However, theconcept can be applied to layers not related tothe physical media. Particularly for newtechnologies such as OTN, shared protectionrings can be defined not only at the OpticalMultiplex Section layer but also at the OpticalChannel layer, which offers combined benefitsin terms of flexibility, efficiency, and restorationtime

• Ring sub-network structureIn (physical) ring sub-networks consisting ofSDH/SONET add-drop multiplexers, restorationhas been seen as offering less benefit thanprotection offers, because the alternative path isknown, and because the traditional restorationtime frames are too slow. The traffic pattern alsohas to be considered in order to determine ifshared protection offers added value. Forexample, for uniform and site-to-site demands,shared protection rings offer an optimal solutionfrom a capacity-efficiency perspective; forhubbed demands, however, dedicatedprotection is often a good alternative

As indicated earlier, any particular survivabilitymechanism might fail to protect against all possiblefault causes. The questions arise, then, as to whatkinds of faults exist, and which faults are recoveredby what mechanism. The first question is easier toanswer then the second.

The following main categories of faults arerecognized:

• Physical medium faults, including fiber cut

• Hardware faults, including node faults, optical-component faults, and any fault causinga connection fault

• Performance degradations, which can be burstor Poisson distributed series of errors

LUCENT TECHNOLOGIES OPTICAL NETWORKING8

Figure 4: Traffic Patterns

WP008.qxd 11/29/1999 09:49 Page 8

• Provisioning errors, including open connections,mis-routing, and wrong signal-characteristicinformation

Trying to establish a general principle fordetermining which faults are recovered by aparticular survivability mechanism is much moredifficult, as it depends not only on the particularmechanism but also on the following:

• Type of transport entity For example, an SDH VC-4 protectionmechanism cannot detect a VC-12 provisioningerror, but a VC-12 protection mechanism candetect a VC-4 provisioning error

• Monitoring type For sub-network connections, differentmonitoring types exist. For example, onemonitoring type might protect against aconnectivity fault resulting from a provisioningerror, whereas another monitoring type mightnot

• Protection domainFor example, an intermediate node is within aprotection domain, but the end node may failoutside the domain

Within the context of multi-layer survivability, themost important parameter to focus on is the faulttype and the effect this fault has on traffic. Faultssuch as physical-medium faults, node faults, andsome hardware faults affect all services in allnetwork layers, and consequently have to berecovered from concurrently (and quickly). Theeffect of other types of hardware faults,provisioning errors, and performance degradationsare often less catastrophic, because fewer servicesare affected, or not all services are affected at thesame time.

MULTI-LAYER SURVIVABILITY

The introduction of different networkingtechnologies (OTN, SONET/SDH, ATM, and IP),

coupled with the wide variety of networkapplications, each having its own QoSrequirements and implementation approach, hasled to multi-technology, multi-layer transportnetworks. The actual nesting of these layers in aparticular operator's network may vary, dependingon the technology deployment and the networkevolution. Additionally, as we've seen, each layermay support a number of survivabilitymechanisms. Figure 5 provides an overview ofpossible ways of nesting layers and survivabilitymechanisms.

LUCENT TECHNOLOGIES OPTICAL NETWORKING 9

Figure 5: Overview of Survivability Mechanisms

WP008.qxd 11/29/1999 09:49 Page 9

The major challenge in fulfilling differentsurvivability requirements in a multi-layer networkconfiguration is developing a set of suitable optionsfor providing various degrees of reliability. A goodsurvivability strategy should leverage the strengthsof the different technologies and their associatedsurvivability schemes. The objective of a multi-layer survivability strategy is to deliver the desiredQoS in a better and more cost-effective way thancould be achieved with a single-layer approach.The multi-layer survivability strategy shouldtherefore be based on the following criteria:

• Performance - the restoration time should besufficiently fast to support desired levels of QoS

• Efficiency - the amount of spare capacityrequired for network survivability should beminimized, because spare capacity has a strongrelation to cost

• Maintainability - the survivability strategyshould support network maintenanceoperations. For example, a well-plannedsurvivability strategy ensures continuity ofservice in the event of failures, enablingoperators to follow their desired practicesregarding repair procedures (for example,during "normal working hours")

• Evolvability - the introduction of new networklayers should not be hindered by survivabilityconsiderations, and should not have adetrimental effect on current services.Specifically, deployment of a new service, or ofa new survivability mechanism, should notaffect existing services and survivability schemes

• Flexibility - the strategy should not restrict anoperator to a single solution, but instead shouldoffer a coherent set of survivability solutionsthat can be tailored to individual operatorrequirements

• Cost - while difficult to quantify, thesurvivability strategy should balance equipmentand operations costs

SURVIVABILITY OPTIONS

There are two survivability options: single-layerrecovery, and multiple-layer recovery [1]. Single-layer recovery employs a single end-to-endor cascaded survivability strategy within one layer.

Multi-layer recovery uses two or more nestedsurvivability mechanisms.

Before identifying particular recovery mechanisms,we introduce several concepts, related to multi-layer networks, that offer a logical framework forthe discussions that follow. We begin by classifyinga multi-layer network into a logical architectureconsisting of three separate layers: physical media,transport, and customer service2:

• The physical media layer includes the fibersand all layers underneath (for example, theduct in which the fibers are placed, the roadsalong the duct runs, etc.). A fault in this layeroften affects all the services that are transportedby means of this layer. There is no automaticsurvivability in the physical layer

• The transport layer(s) acts as a server forcustomer services, offering transmissionbandwidth and management functionality, anddecoupling the service layer from the physicallayer. While it can provide transport-failuresurvivability mechanisms, the transport layercannot protect against service-layer failures

• The customer service layer represents theservice offered by an operator to its customers.Service failures can be recovered by means ofsurvivability mechanisms at this layer. We notethat services at this level might includepremium SONET/SDH services, leasedwavelengths or lines, etc.

A multi-layer network can support a variety ofapplications. Whether a network technology iscategorized as a transport or customer-service layerin a multi-layer environment depends on what anoperator has chosen to offer as a service, andwhich technologies the operator has chosen todeploy in its network. Depending on theapplication, a single layer can play the role of acustomer-service layer or the role of a transportlayer. For example, if an operator offers "Voiceover IP" (carried over an ATM, SDH/SONET,and/or OTN network), IP is considered to be at thecustomer-service layer, and the other layers areconsidered to be transport layer(s). On the otherhand, if the operator is offering a SDH/SONET"leased line" service (transported over the OTN),the SDH/SONET layer acts as the customer-servicelayer to the OTN transport layer.

LUCENT TECHNOLOGIES OPTICAL NETWORKING10

2 Our usage of the term layer here is less rigorous that as defined in ITU-T Recommendation G.805, functional modeling

WP008.qxd 11/29/1999 09:49 Page 10

Single-layer recovery can be performed in atransport layer as well as in the customer-servicelayer. The trade-offs to be considered depend onwhat the service and transport layer(s) are for theparticular application under consideration. In ascenario involving traditional IP as the service, andSDH/SONET as the transport layer, when there is aphysical-medium fault, transport-layer recovery isfaster than service-layer recovery. The transportlayers restore the traffic in larger-granularitybundles, making the recovery approach moreeffective - especially for catastrophic faults such asfiber cuts - and making network maintenanceeasier to handle. However, the transport-recoveryscheme cannot detect service-layer equipmentfaults, and is therefore not able to provide arecovery mechanism for these faults. On the otherhand, the service-layer recovery is able to restoreall faults, including physical, transport, and service-layer faults. It is too slow, however, to efficientlyrecover from failures at the physical-media layer.

Multi-layer recovery can effectively combine themerits of transport-layer and service-layer recoveryschemes. However, without an appropriatestrategy, a single fault can trigger multiple recoverymechanisms, which may thereupon interact witheach other, leaving the network in an undesirableor unknown state. A good multi-layer strategyidentifies where the nesting of survivabilitymechanisms can be useful, and how undesiredinterworking in those cases can be avoided.

MULTI-LAYER SURVIVABILITY ISSUES AND

BENEFITS

As mentioned earlier, multi-layer survivability canoffer a number of benefits to operators. However,improper deployment of multiple survivabilityschemes can result in a number of pitfalls, themain two of which are illustrated in Figure 6. Thefirst pitfall relates to the fact that a single physical-layer fault (such as a fiber cut) can resultin many unnecessary protection actions at thecustomer-service layer. The second pitfall involvesthe potential for considerable wasted bandwidth,resulting from each layer having its own spareresources to use during faults.

Depending on the survivability mechanism, asingle-physical fault results in one or many necessaryprotection actions at the transport layer. The pitfallconsists of the many unnecessary protection actionsthat may occur at the customer-service layer.Major concerns for operators are as follows:

• It may leave the network in an undesired orunknown state, which can greatly extend anend-user's measured outage resulting from asingle fault, and thereby potentially activate the"minimum availability" penalties of service-levelagreements

• It may complicate the operator's maintenancetask

LUCENT TECHNOLOGIES OPTICAL NETWORKING 11

Wasted Bandwidth

Pitfall :

Pitfall :

Unnecessary protection actions

Multiple

service layer

actions ...

Spare capacity in the transport

layers can be up to 75%

Service Layer

Service Layer

Transport Layer

Transport Layer

... from a single

physical layer

fault

X

Figure 6: Common Multi-Layer Survivability Pitfalls

WP008.qxd 11/29/1999 09:50 Page 11

• It does not guarantee protection independence.A fault in one sub-network may causeunnecessary protection switching ininterconnected sub-networks that carry theend-to-end trail or network connection

The amount of spare resources has a strong effecton the cost. However, we should not assume thatmulti-layer survivability requires more sparecapacity. There are a number of factors influencingthe actual amount of spare capacity required:

• To support customer-service-layer survivability,it must be recognized that spare capacity has tobe allocated not only in the service layer butalso in the transport layer(s). The transport-layer spare capacity provides in this case thealternative routes used by the service-layersurvivability mechanism, which may not beused for transport survivability. The total costinvolved in this survivability solution is relatedto the total amount of spare capacity requiredby all the layers

• The spare capacity required in the customer-service layer depends on the plannedfault coverage. If the intention is to protectagainst catastrophic faults, the service-layer andtransport-layer overbuild will be large. Analternative is to use the service-layer recoveryonly for the less-disrupting service fault. Thetransport-layer overbuild is directly used forefficient transport-layer recovery from largedisrupting faults. The customer-serviceoverbuild may be less, because it only has torecover the less-disrupting service faults,depending primarily on the recoverymechanism and the associated protocol type

• The protocol type can be circuit based, as in theOptical and SDH/SONET layer, or packet based,as in the ATM and IP layer. Nesting two circuit-based survivability mechanismsduplicates the amount of spare capacity and isnot attractive for this reason. However, nestinga packet-based survivability mechanism over acircuit-based survivability mechanism can beattractive

When properly deployed, multi-layer survivabilitycan result in a number of benefits, the most

important of which is reduced overall cost. This isespecially true when both a circuit-based and apacket-based recovery mechanism are involved.The main reason for this is that properly deployedbandwidth management at a circuit layer, incombination with circuit-based recovery, avoidsservice-layer overbuild. To understand why this istrue, we use an example, shown in Figure 7, tocompare a single packet-based survivabilitymechanism with a circuit-based transport recoverymechanism. This example consists of an access ringof five nodes, with hubbed traffic demand at thepacket-based service layer.

First, we note what happens when a single-layerrecovery mechanism is used at the customer-service layer, an approach often referredto as a service layer overbuild architecture3. Inthis approach, the nodes adjacent to each other areconnected with dumb pipes. As we see in Figure 8,

LUCENT TECHNOLOGIES OPTICAL NETWORKING12

Figure 7: Hubbed Traffic Demand in the ServiceLayer

Service Layer

Transport Layer

"Pass trough tax"

"Pass trough tax"

Figure 8: Service Layer "Pass Through Tax�

3 There are other issues and limitations with Service Layer Overbuild Architecture; for example, the end-to-end latency mayexceed the maximum allowed delay due to the forwarding delays in the routers.

WP008.qxd 11/29/1999 09:50 Page 12

there is already some service-layer overbuild, dueto the "pass-through tax"4 in some of the nodes.

In real networks, the cost of the service-layer"pass-through tax" can be considerable, and thepotential bandwidth efficiency of statisticalmultiplexing may not be enough to offset the pass-through tax. The degree of service-layer

overbuild further increases when the architectureneeds to support recovery from a physical failure,such as a fiber cut, as illustrated in Figure 9.

Service-layer overbuild can be avoided by properlydeploying transport bandwidth management andrecovery functions in the transport layer. Theservice nodes are then connected through atransport networking architecture (see Figure10).

Because this architecture offers more functionalityin the transport layer, the transport layer cost willconsequently increase. However, in real networks,it is the overall balance of network infrastructureand operations cost that has to be considered.There is some misunderstanding that transportnetworking always equates to fully dedicated, one-for-one protection bandwidth. This is clearly notthe case, especially for the meshed-traffic patternsfound in long-haul and core networks, whereshared transport-protection architectures (sharedprotection rings and mesh-based restoration) areideal. Such architectures minimize protection-capacity overbuild, are on a par with any realisticservice-layer scheme, and actually result in lowernetwork cost.

Results of a research project for an SDH/ATMnetwork ([2]) indicate that it is more cost-effectiveto deploy SDH and ATM survivability mechanismsin conjunction than it is to deploy a single ATMprotection mechanism. This result appears to betreu also for IP/OTN networks, as discussed in arecent Lucent white paper [3]. This white papercites internal Lucent studies of customer networksand traffic forecasts for IP/OTN networks. Thesestudies indicate that, on a first-cost basis, serviceoverbuild architectures cost at least as much asthose built on transport-networking architectures.Further, when lifecycle costs are factored in,especially those driven by the rapid pace of service-layer change-outs, transport networking fares evenbetter.

Other benefits of a properly deployed multi-layersurvivability strategy include:

• Fast recovery by means of a transport-layersurvivability mechanism, especially for physical-layer faults and some of the largerdisturbance-inducing hardware faults

• Service-layer recovery for those faults notcovered by by the transport-layer survivabilitymechanism

• A small number of protected transport entities,simplifying maintenance for the networkoperator

• Peaceful co-existence of individual survivabilitystrategies in a multi-layered network supportingmultiple applications

LUCENT TECHNOLOGIES OPTICAL NETWORKING 13

Service Layer

Transport Layer

Fiber cut

Service Layer Overbuid,

to provider protection capacity

Figure 9: Service Layer Overbuild Architecture

Service Layer

Transport Layer

Fiber cut

Transport layer

provides bandwith management

and protection capacity

Figure 10: Transport Networking Architecture

4 Without traffic-expressing provided by the transport layer, we would need more line interfaces on routers, and biggerthroughput routers. All of this translates into an additional cost, called pass-through tax.

WP008.qxd 11/29/1999 09:50 Page 13

Deployment of multi-layer survivability can makenetwork administration and engineering rulesmore complex. Even in the case of single-layersurvivability schemes in current networks,operators are encountering deployment difficultiesrelated to ensuring that diverse physical routes areused for the working and protection-transportentities. Nested survivability will make the task ofdiverse routing even more difficult. However, in amulti-service network, no single survivabilitystrategy fits all services requirements, operatorpolicies, or technology choices. On a per-servicebasis, an operator has to consider if a singlesurvivability mechanism is sufficient, or if multiplesurvivability mechanisms are required. Thus, asound multi-layer survivability strategy has toensure that multiple services can be reliably carriedin a single network and meet QoS requirements,while minimizing overall equipment andoperations cost.

MULTI-LAYER SURVIVABILITY PRINCIPLES

When service-reliability requirements cannot beachieved with single survivability scheme, multiplerecovery approaches may need to be employed.Normally, two survivability mechanisms should besufficient, because the additional cost of extra sparecapacity for a third mechanism generally cannot bejustified. We recommend using one recoveryscheme in the transport layer, and one in thecustomer-service layer.

This section provides the principles governing thedeployment of two survivability mechanisms, andexplains how to avoid interworking issues betweenthe two mechanisms.

First, we need to look at the possible hierarchy ofnested survivability mechanisms (Figure 11). Thishierarchy is highly dependent on the respectiverestoration times, in addition to the client/serverrelationships.

Of course, the illustration in Figure 11 can onlyrepresent a snapshot in time; the actual hierarchywill certainly change as technologies andimplementations evolve. For example, ATMprotection functions now implemented in softwaremay become much faster when implemented inhardware, and more-advanced restorationmechanisms may be developed. Additionally, majorresearch and implementation efforts are underwayto establish alternatives to traditional IP restorationmechanisms. Given these caveats, we can draw thefollowing conclusions on the basis of Figure 11:

• Traditional IP dynamic-rerouting and the ATMrestoration mechanism are currently too slow tosupport recovery from physical media/transportlayer faults

• OTN and SDH/SONET restoration, usingdistributed mesh-based approaches, should beable to provide much faster recovery than is

LUCENT TECHNOLOGIES OPTICAL NETWORKING14

Optical restoration

Optical protection

ATM protection (VP rings)

SDH and SONET restoration

SDH/SONET protection

Traditional IP rerouting

Architecture Restoration Time

minutes

seconds

100s of milliseconds

up to 50 ms

PNNI : ATM restoration

Figure 11: Possible Hierarchy of Nested Survivability Mechanisms

WP008.qxd 11/29/1999 09:50 Page 14

possible with traditional centralized mesh-basedapproaches

• ATM protection may become sufficiently fast tosupport recovery from physical media/transportlayer faults

• OTN and SDH/SONET protection, being veryfast, can always provide acceptable transport-layer recovery

In general, providing transport-layer recovery asclose as possible to the physical-media layer tendsto be most efficient, because spare capacity over allthe affected layers, as well as the number oftransport entities involved, is minimized.Depending on specific network characteristics, andthe kind of services offered by the networkoperator, what constitutes the lowest transportlayer may differ. For example:

• AccessCurrently, the lowest transport layer is mainlySDH/SONET; however, in the future, both OTNand ATM layers (ATM for performingbandwidth management) may be deployed

• CoreAgain, currently the lowest transport layer ismainly SDH/SONET, with some point-to-pointDWDM. From an evolution perspective, thepoint-to-point DWDM networks shouldultimately become OTNs

Most importantly, it is no longer a given that asingle technology can be expected to providenetwork survivability, as was the case forSDH/SONET. In fact, we expect that different sub-networks may use different technologies fortheir primary recovery.

The principles avoid, do nothing, hold-off, andcooperate may be used for guidance in examiningapproaches (as well as consequences) for multi-layer interactions that might occur when transportand service-layer recovery mechanisms are nested.Avoiding interworking issues is the preferredapproach, if it is possible to do so and still supportan operator's service portfolio within existingoperations and maintenance policies. On the otherhand, cooperate represents the least-preferredapproach, because of the need for inter-layersignaling mechanisms, and because of associatedcomplexity/backwards compatibility issues thatmay result when already-deployed transporttechnologies are involved.

In general, activating multiple recovery schemesthat act on the same root cause or fault should beavoided. When no conflict in either detection ortransfer times is expected for particular nestedrecovery mechanisms, it may be preferable to alloweach layer to continue to be responsible for its ownrecovery independently - in other words, donothing. If avoidance and doing nothing are notpossible, and the restoration time-scales in themulti-layer network are conflicting, a hold-offapproach might be an option.

Only as a last resort is it recommended that thedifferent layers cooperate to avoid a conflict bymeans of an inter-layer signaling strategy fordeployed technologies. The major issue withcooperation in current networks is that there aretoo many potential nesting combinations (eachrequiring individual solutions) that impact existingstandards as well as the large embedded base ofdeployed equipment. Of course, the situation istotally different if interlayer signaling takes placewithin a network element, where vendorproprietary approaches are appropriate, and wherethe element is part of a single-vendor sub-networksolution.

MULTI-LAYER SURVIVABILITY INTER-LAYER

RECOMMENDATION

Protection SelectivityThe most efficient recovery occurs at the layerclosest to the physical layer. However, followingthe first principle, deployment of multiple recoverymechanisms protecting against the same (physical)faults should be avoided, which suggests protectionselectivity. Protection selectivity based on the finestgranularity offers the ability to exercise or disablerecovery mechanisms on the basis of operatordecisions, per individual channel. The mainargument here is efficiency/cost and flexibility.The ability of an operator to decide - on a case-by-case, service-by-service basis - where survivabilityis needed across a multi-layer network, meansthat, in general:

• The total spare capacity across all the networklayers is minimized

• Associated maintenance and operations areoptimized, because the number of transportentities is minimized

LUCENT TECHNOLOGIES OPTICAL NETWORKING 15

WP008.qxd 11/29/1999 09:50 Page 15

For example, when the OTN layer is the lowestlayer of a multi-layer network offeringSDH/SONET as a premium service to customers,protection selectivity at the Optical Channel layeroffers the following benefits:

• Survivability at the SDH/SONET layer aloneoffers recovery from physical-media, OTN, andSDH/SONET equipment faults more efficientlythan a multi-layer approach does

• It enables a network operator to leaveuntouched existing services that useSDH/SONET recovery mechanisms

We note that current deployments of OTNsurvivability mechanisms are restricted to thesimple dedicated protection mechanisms. However,more-advanced shared-protection mechanisms arebeing developed, and will be increasingly attractivefor future deployments.

Optical channel-protection selectivity, shown as anexample in Figure 12, will be a critical need for aslong as SDH/SONET and OTN co-exist intelecommunications networks. We note that per-optical channel selectivity applies to OTN optical-channel layer restoration schemes.

Hold-Off TimerIn general, the customer-service layer is not able todistinguish among faults arising from the physical-media, transport, or customer-servicelayers. A service-layer survivability mechanismmight react on the basis of physical fault-relatedtriggers, even as the transport layer is recoveringfrom them. When the detection time of thecustomer-service layer is slower than the totalrestoration time of the transport survivabilitymechanism, there is no conflict and the donothing approach applies. This cannot always beguaranteed, and may, in fact, be the exceptionrather than the rule. Therefore, initiating recoveryof the service-layer survivability mechanism maybe delayed until after a pre-defined time, in orderto assure that the transport survivabilitymechanism is not able to recover from the fault.This approach is called hold-off, and is usedwithin SDH.

For example, SDH provides a provisional hold-offtimer for the lower- and higher-order VC SNCP,which may be used for premium service, in orderto allow the MSP or MS-SPRing protectionmechanism to react first. Looking to thesurvivability mechanisms that can act as a service-layer recovery scheme, in conjunction witha transport survivability mechanism, provisionablehold-off timers may become important for therecovery mechanism as well as for ATM protection.

LUCENT TECHNOLOGIES OPTICAL NETWORKING16

Optical network

protection disabled

on this OCH

Example :

STM ring protection

over Optical Network

Protection Selectivity :

the ability to disable optical

layer protection so that client

layer protection operates

without conflicts

STM

STM

STM

Figure 12: Optical Channel Protection Selectivity.

WP008.qxd 11/29/1999 09:50 Page 16

Using hold-off timers implies the need forengineering rules that take into account thevarious network topologies and multiple-administrative domains. This requirementcomplicates network maintenance, because hold-off timers have to be provisioned and may need tobe re-provisioned as the network evolves. U.S.-based operators tend not to use hold-offtimers because embedded SONET systems do notoffer provisionable hold-off capability, whereassome SDH operators opt for fixed hold-off timers,depending on the particular survivabilitymechanism.

MULTI-LAYER SURVIVABILITY NETWORK

ARCHITECTURE RECOMMENDATION

Supporting Extra TrafficAs a result of fiber exhaust, the ability to supportextra traffic is becoming increasingly important.This trend will continue into future networks forthe following reasons:

• The capacity on a single link will continue togrow into the multi-Gigabit range. Supportingextra traffic is a cost-effective means ofdecreasing the cost per bit as the total cost oftransporting traffic on a single fiber increases,particularly for best-effort traffic. Therefore, itwill be attractive to support extra traffic onthese interfaces because, given the amount ofdata, management is simple and cost-effective.

• ATM (and, in the future, IP) supports QoSdifferentiation. Extra traffic can be used tosupport the ATM traffic types Unspecified BitRate (UBR) and Available Bite Rate (ABR), orthe IP traffic type "Best Effort" (Figure 13).

• Deploying extra traffic in a multi-layersurvivability environment that includes apacket-based protocol can be very attractive.The combination of service-layer restorationand statistical multiplexing makes it possible forATM/IP-based services, carried as extra traffic inthe transport layer, to be recovered. Theflexibility of a service-layer restorationmechanism ensures that an alternative path willbe found if such a path exists. Finding analternative path may take minutes; however,such a lengthy interval is not a problem,because it only affects lower-priority traffic.Further, QoS differentiation, combined withstatistical multiplexing, ensures that extra trafficis restored only when the actual capacity for itbecomes available.

Avoiding Single Points of FailureMulti-layer survivability can help avoid singlepoints of failure in a multi-layer network. Forexample, consider the case of dual-node

LUCENT TECHNOLOGIES OPTICAL NETWORKING 17

Best effort IP as Extra

Traffic on VPN's

protection OCH

Extra Traffic :

the ability to place lower

priority (preemptable)

services on idle protection

bandwidth

Working OCH

Idle protection OCH

1:1 protection function

Example :

VPN via OMS/SPRing

X

X

X

X

Figure 13: Use of Extra Traffic

WP008.qxd 11/29/1999 09:50 Page 17

interworking, which was introduced in an effort toavoid a single point of failure when cascadedSONET/SDH survivability mechanisms are used.When such SONET/SDH networks evolve toencompass new technologies, associated means ofproviding robust survivability will need to beintroduced. This will result in a range of manypossible combinations of cascaded mechanisms.

To successfully support dual-node interworking incascaded architectures, protection and restorationmechanisms in the multiple layers may need toimplement this concept. In the worst case, thesemechanisms would have to cooperate by means ofa control protocol. Not only would this be anenormous standardization task, it might affect theentire SONET/SDH embedded base. Anotherdisadvantage of dual-node interworking is thatextra traffic is not supported in all cases. Thus,implementing dual-node interworking in a multi-layered network can be a major effort;hence, a "Not all that can be done, has to be done"approach needs to be considered.

Multi-layer survivability offers an alternative wayof solving the single point of failure problem.Especially for ATM/IP applications, the transport-layer survivability mechanism can leave therecovery of the single point of failure to theATM/IP recovery mechanisms. However, a totaloverbuild of the service nodes has to be avoided,and QoS differentiation, combined with the statisticmultiplexing, may already be sufficient.Otherwise, end-to-end ATM protection may beused for applications where the required QoScannot otherwise be achieved.

CONCLUSIONS

This paper has described the major aspects ofmulti-layer survivability, has identified the

major benefits that may be derived, and hasdemonstrated that multi-layer survivability hasvalue when properly deployed. Especially for ATMand IP applications, multi-layer survivability cansignificantly improve a network operator's abilityto ensure QoS, while enabling network efficiencyand cost savings.Multi-layer survivability combines the merits oftransport-layer recovery (for example,OTN/SDH/SONET) with service-layer recovery (forexample, ATM/IP). Transport-layer recovery offersfast restoration and easy maintenance because offewer numbers of larger-granularity transportentities, and it supports the possibility for carryingextra traffic. Service-layer recovery, in combinationwith QoS differentiation and statisticalmultiplexing, as appropriate, can be used to offerdifferent types of enhanced service offerings.

This paper makes recommendations for therecovery mechanisms serving transport-layer andservice-layer roles. The recovery mechanismserving in the lower layers (for example, OTN orSDH/SONET) should allow for protectionselectivity; that is, the flexibility to turn on or offthe recovery scheme at the finest level ofgranularity. The recovery mechanisms serving inthe higher layers (for example, IP or ATM) shouldbe able to delay recovery until after a pre-definedtime, allowing the lower-layer survivabilitymechanism to recover the fault - this is the hold-off approach.

LUCENT TECHNOLOGIES OPTICAL NETWORKING18

WP008.qxd 11/29/1999 09:50 Page 18

REFERENCES

[1] Resilience in a multi-layer network, P.Demeester, M. Gruseels, K. Van Doorselaere, andothers, University of Gent. Part of the ACT projectAC205/PANEL. Paper was presented "FirstInternational Workshop on the Design of ReliableCommunication Networks", Brugge, Belgium, May1998.

[2] A cost evaluation of service protectionstrategies in ATM on SDH transport networks,Michaël Cryseels, University of Gent, Satoru Ohta,NTT, Roberto Clemente, CSELT. Part of the ACT

project AC205/PANEL. Paper was presented "FirstInternational Workshop on the Design of ReliableCommunication Networks", Brugge, Belgium, May1998.

[3] The Value of Transport Networking in a Data-Centric World, Lucent white paper, Curt Newton,EMEA Product Market Development, Issue 1.0,April 1999.

LUCENT TECHNOLOGIES OPTICAL NETWORKING 19

GLOSSARY

Abbreviations Used:

ABR Available Bit Rate

ATM Asnychronous Transfer Mode

BLSR Bi-directional Line Switched Ring

DWDM Dense Wavelength Division Multiplexing

IP Internet Protocol

ITU International Telecommunication Union

MS-SPRing Multiplex Section Shared Protection Ring

MSP SDH Multiplex Section Protection

OCh Optical Channel

OMSP Optical Multiplex Section Protection

OSPF Open Shortest Path First

OTN Optical Transport Network

PDH Plesiochronous Digital Hierarchy

PNNI Private Network Node Interface

QoS Quality of Service

RIP Routing Information Protocol

SDH Synchronous Digital Hierarchy

SLA Service Level Agreement

SNCP SubNetwork Connection Protection

SONET Synchronous Optical Network

SPRing Shared Protection Ring

UBR Unspecified Bit Rate

UPSR Uni-directional Path Switched Ring

VC Virtual Channel

VP Virtual Path

VPG Virtual Path Group

WP008.qxd 11/29/1999 09:50 Page 19

This document is for planning purposes only, and is not intended to modify or supplement any Lucent Technologies specifications or warranties relating to these products or services. Performance figures and data quoted in this document are typical and must be specifically confirmed in writing by Lucent Technologies before they become applicable to any particular order or contract. The company reserves the rightto make alterations or amendments tothe detailed specifications at its discretion.

The publication of information in this document does not imply freedom from patent or other protective rights of Lucent Technologies or others.

WaveStar and AllMetro are trademarks ofLucent Technologies Inc.

To learn more about our comprehensiveportfolio and the new WaveStar™ ONGSeries, please contact your LucentTechnologies Sales Representativeor call 1-888-4 LUCENT. Visit our web site at http://www.lucent-optical.com

Copyright © 1999 Lucent TechnologiesInc.All rights reservedPrinted in Holland / USA

Lucent Technologies Inc.Marketing CommunicationsOrder Number: WP-008/990702

If you enjoyed this publication and are interested in our other White Papers, please visit the Optical Networking web site at http://www.lucent-optical.com/resources/ for a complete listing of currently available titles.

WP008.qxd 11/29/1999 09:50 Page 20