49
www.huawei.com Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. OptiX SDH System Troubleshooting Methods

OTA000301 OptiX SDH System Troubleshooting Methods ISSUE 1.20

Embed Size (px)

Citation preview

www.huawei.com

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved.

OptiX SDH System Troubleshooting Methods

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page2

Objectives

Upon completion of this course, you will be able to:

List the common analysis methods of fault locating.

Outline the Fault Handling Flow.

Analyze the typical faults: traffic interruption, error bit,

etc.

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page3

Contents

1. Troubleshooting Preparation

2. Troubleshooting Idea and Methods

3. Classified Troubleshooting Examples

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page4

Contents

1. Troubleshooting Preparation

2. Troubleshooting Idea and Methods

3. Classified Troubleshooting Examples

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page5

Requirements for Maintenance Staff-I

Be familiar with hardware system and SDH fundamental,

Be familiar with alarm generation mechanism and signal flow

in transmission system

Be familiar with the basic maintenance instruments and tools

Familiar with the network under maintenance

Network topology, network protection, traffic configuration

Professional SkillsProfessional Skills

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page6

Requirements for Maintenance Staff-II

Be familiar with common alarms

SDH line alarms (R_LOS, R_LOF, R_OOF, AU_AIS, AU_LOP, MS_AIS, MS_RDI, B1_EXC,

B2_EXC, HP_LOM, HP_SLM, HP_TIM, HP_UNEQ);

PDH tributary alarms (TU_AIS, TU_LOP, T_ALOS, T_DLOS, P_LOS, EXT_LOS,

UP_E1_AIS, LP_RDI, LP_SLM, LP_TIM, LP_UNEQ, B3_EXC);

Protection switching alarms (PS);

Clock alarms (LTI, SYNC_C_LOS , SYN_BAD);

Equipment alarms (POWER_FAIL, FAN_FAIL, BD_STATUS).

Collect and save on-site data

System alarms, performance events data, configurations, operation records of NMS

Professional SkillsProfessional Skills

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page7

Fault Handling FlowStart

Record fault trace

External cause?Other handling

flows

Analyze the fault tolocate it

Report the fault toHuawei

Faultremoved?

Yes

Yes

No

No

Continue 1Continue 2

Flow ChartFlow Chart

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page8

Continue 1

Make solution together

Service recovered?

Observe service running

Archive the fault handling report

Fault removed?

End

Try the solution

No

No

Yes

Yes

Continue 2

Fault Handling Flow - cont.Flow ChartFlow Chart

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page9

Contents

1. Troubleshooting Preparation

2. Troubleshooting Idea and Methods

3. Classified Troubleshooting Examples

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page10

Question

What is the key for troubleshooting ?

To locate a failure ACCURATELY in one station

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page11

Network first, then network elements Try your best to locate the troubles to one

node

How to Locate a Fault?

Broken fiber, switch failure Power failure, grounding

External first, then transmission

Basic Principles of Fault Localization Basic Principles of Fault Localization

LU first, then TU

Higher-severity alarms first, then Lower-severity alarms

First analyze critical/major alarms.

Then come to minor/warning alarms.

LU alarms can lead to TU alarms

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page12

Common Methods of Fault Localization

1 、 Alarm and performance analysis

4 、 Configuration Data Analysis

2 、 Loopback

6 、 Test with instruments

3 、 Replacement

7 、 Experience

5 、 Configuration Modification

Keys of Fault Localization Keys of Fault Localization

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page13

Use NMS How to obtain alarms and performance?

Observe indicators on boards and cabinets

•Not detailed•No history alarms

•Comprehensive•All alarms/performance events from the whole network

•Accurate• Current alarms, history alarms, occurrence time and performance event data can be queried.

Alarm and Performance Analysis-I

Evaluate Whole Network Evaluate Whole Network

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page14

Obtain alarm and

performance events

Select the key alarm or

performance events

Analyze reasonsLimit the troubles to a

certain range or a node

Alarm and Performance Analysis-IIMain Steps Main Steps

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page15

11 22 33 44

w w w wE E

TU-AISLP-RDI

R-LOS

MS-RDI

Alarm and Performance Analysis-III

Case Case

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page16

Line SDH equipment Line

Inloop Inloop

Inloop

outloop outloop

outloop

Tributary

Loopback

Loopback is the most common, most efficient method in troubleshooting.

What is Loopback?What is Loopback?

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page17

Board involve

d

Loopback options

Loopback tools

Loopback level

Application

Tributary board

Inloop/outloop

Loopback cable, NMS

Loopback at path level

Separate switching faults from transmission faults. Determine the tributary board failure roughly. Be unnecessary to modify service configuration.

Line board

Inloop/outloop

Patch fiber, NMS

Loopback by optical interface

Locate single station faults. Roughly determines the line board failure. Be no need to modify service configuration.

Software loopback is NOT an absolute method, why? May interrupt the traffic and ECC Will automatically be removed in 5 minutes (provisional) Notes

Loopback Where Do We Loop?Where Do We Loop?

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page18

Select one NE from several faulty NEs; Choose one affected traffic path from the selected faulty NE; Draw the traffic flow diagram (source, sink, pass through); Connect testing devices; Check alarms.

Loopback ProcedureProcedure

321w2:17 w2:17 w2:17e2:17

t2:1 t2:1

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page19

ApplicationObjective

Fiber CableFiber Cable

BoardBoard ModulesModules

External faultsExternal faults

Board faults Board faults

Replacement

MSP switch SNCP switch Active/standby XC switch TPS switch

When to Use?When to Use?

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page20

Configuration Data Analysis

Timeslot configuration

J1 or C2 bytes

LU and TU paths loopback

SNCP or MSP switching conditions

External commands (e.g. locked switch)

Consistent Configuration in both NMS and NEs

Query & Analyze the ConfigurationQuery & Analyze the Configuration

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page21

Port

Timeslot

Sub-rack Slots

No spare boardsNo spare boards

Restore the traffic Restore the traffic

temporarily temporarily

Objective Application Examples

Configuration Modification

Fast SolutionFast Solution

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page22

Instrument Test item

Bit error testing device Bit error/traffic

Optical power meter Optical power

SDH analyzer Bit error/traffic/overhead bytes

……

Multi-meter Voltage/current/resistance

This method is the most reliable one, but we must have the devices in hand.

Testing Instrument

Accurate JudgmentsAccurate Judgments

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page23

Experience

Reset board

Power off and on

Resend the configuration

Do not consider them as cure-all.

They are not helpful for us to find the real cause of

the failure.

Rule of ThumbRule of Thumb

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page24

SummaryMethods Application Features

Alarm and performance

analysisUniversal

1. Evaluate the whole network situation. 2. Locate the faulty point preliminarily based on the collected data. 3. Cause no negative effect on normal services 4. Depend on the NMS

LoopbackLocate the fault

to a single station or board

1. Independent of alarm and performance event analysis2. Rapid and effective

Replacement Locate the fault

to a board or isolate external

faults

1. Convenient 2. Require spare parts/equipment. 3. Applied with other methods

Configuration data

analysis

Locate the fault to a single

station or board

1. Can find the fault cause.2. Fault locating time is longer. 3. Depend on the NMS

Configuration

modificationLocate the fault

to a board 1. Have a high risk. 2. Depend on the NMS

Test with instruments

Isolate external faults and

resolve interconnectivit

y problem

1. A general method with high accuracy 2. Have certain requirements for the meters. 3. Applied with other methods

Experience Special cases1. Fast fault handling 2. High probability of mistake 3. Need experience accumulation.

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page25

Contents

1. Troubleshooting Preparation

2. Troubleshooting Idea and Methods

3. Classified Troubleshooting Examples

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page26

Troubleshooting SequenceExclude external troubles

Switching problem?

Fiber problems?

Trunk cable?

Power supply system?

Grounding problem?

Replacement

Instrument testing

Loopback

Alarm/performance analysis

Locate troubles to one NE

Loopback

Alarm/performance analysis

Locate the troubles to one board

Replacement LoopbackAlarm/performance analysisConfiguration analysisConfiguration modification Rule of Thumb

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page27

Traffic Interruption

Bit Errors

Classified Troubleshooting Examples

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page28

Traffic Interruption

Possible Causes

Power supply systemequipment power off,

under voltage, etc.Switch problemsFiber or trunk cables

Excessive attenuation,

fiber cutCable disconnection

LoopbackData modification

Faulty board Performance degrade

External causes Operation causes Equipment failure

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page29

Traffic Interruption

Equipment operatorCheck the indicator status on each boardAnalyze the alarmsHardware loopbackReplacement

NMS operatorCheck the login of each station Query and analyze alarmsLoopback section by sectionConfiguration modificationImplement switch

OperationsOperations

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page30

No-protection Line-CaseNo-protection Line-Case

Traffic Interruption

Network Configuration Node 1 is the centralized services node. Each station has E1 services with node 1.

Failure Description Interrupted E1 service between node1

and 4 Node 4:TU-AIS Node 1: LP-RDI

Other services normal

11 22 33 44w w w wE E

TU-AISLP-RDI

t2:1 t2:1

2:1 2:1 2:1

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page31

11 22 33 44w w w wE E

TU-AISTU-AISLP-RDILP-RDI

t2:1 t2:1

2:1 2:1 2:1

Query alarms

TU-AIS in node 4 only

Failure location between nodes 3 and 4

Alarm analysis

Traffic InterruptionWhere is the Problem?Where is the Problem?

Node 4 can not receivethe traffic from node 1

Other traffic normal between nodes 1, 2, 3

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page32

11 22 33 44w w w wE E

BER testert2:1 t2:1

2:1 2:1 2:1

Connecttester

Outloop on VC4 #2 at node 4

Normal Failure in node 4Yes

No

Failure between nodes 3, 4

Soft Inloop on VC4 #2at east LU of node 3

Normal Failure in node 3No

YesFailure between nodes 3, 4

Hard Optical portinloop at east LU of node 3

Normal Failure in node 3No

YesFailure in node 4

Loopback

Traffic InterruptionAnalysisAnalysis

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page33

11 22 33 44w w w wE E

TU-AISTU-AISLP-RDILP-RDI

t2:1 t2:1

2:1 2:1 2:1

Replacement Locate failure inone node

Maybe LU/TU/XC faulty

Traffic normal Replace faulty TUYes

NoActive/standby XC switch

TPS switch

Traffic normal Replace faulty XCYes

No

Replace faulty LU

Traffic InterruptionFinal SolutionFinal Solution

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page34

wSNCP Ring

e

e

ee

w

ww

3

2

4

1

SNCP Ring-CaseSNCP Ring-Case

Network Configuration Node 1 is the centralized

services node. Each station has E1 services

with node 1. Failure Description

All E1 services interrupted Nodes 2, 3, 4: TU-AIS Node 1: LP-RDI

LP-RDILP-RDI

TU-AISTU-AIS

Traffic Interruption

TU-AISTU-AIS

TU-AISTU-AIS

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page35

Thoughts and

methods

Alarm/performance analysis

Disconnect ring, convert to line

Replacement

Loopback

Analyze configuration correctness

Traffic Interruption

Where is the Problem?Where is the Problem?

wSNCP Ring

e

e

ee

w

ww

3

2

4

1

LP-RDILP-RDI

TU-AISTU-AIS

TU-AISTU-AIS

TU-AISTU-AIS

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page36

MSP Ring-CaseMSP Ring-Case

Network Configuration Node 1 is the centralized services

node. Each station has E1 services with

node 1. Shortest service route configuration Failure Description Fibers between NE2-NE3 are broken

R-LOS E1 services interrupted between nodes1

and 3 Nodes 1, 3: TU-AIS

Other services normal

wMSP RingSTM-4

ee

e

ee

ww

ww

3

2

4 5

1

TU-AISTU-AIS

TU-AISTU-AIS

R-LOSR-LOSR-LOSR-LOS

Traffic Interruption

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page37

MSP switch process

Traffic Interruption LU

SF or SD detection K1 & K2 bytes transmission

SCC Normally process APS

protocol Started APS controller Right switch state

XC Implement switching

Protection channels Available

Where is the Problem?Where is the Problem?

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page38

wMSP RingSTM-4

ee

e

ee

ww

ww

3

2

4 5

1

R-LOS

R-LOS

APS-INDI

APS-INDI APS-INDI

APS-INDI

APS-INDIS

S

P

P

P

Query and checkalarms

Check switch status

Normal

Maybe APS protocol stopedRestart it

Yes

No

Draw switched traffic flow diagram

Replace faulty LU/XC

Resend configuration

Switch status normal

No

Yes

Restart APS protocol node bynode to locate faulty LU/XC

Switch status normal

No

Yes

Loopback section after sectionto locate faulty LU/XC

Traffic Interruption

AnalysisAnalysis

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page39

Traffic Interruption

APS-INDI

wMSP RingSTM-4

ee

e

ee

ww

ww

32

4 5

1

R-LOSR-LOS

APS-INDI APS-INDI

APS-INDI

APS-INDIS

S

P

P

P

Normal route

Switched route

321e1:17 w1:17 w1:17e1:17

t2:1 t2:1

e1:17

w1:17 w3:17w3:17 w3:17 w3:17

e3:17 e3:17 e3:17 e3:17

t2:1 t2:1

121 5 4 3

Notes One complex lineCan use dichotomy

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page40

Bit Errors

Possible Causes

Performance degradation of

fibers, excessive attenuationDirty fiber joint or incorrect

connectorPoor equipment groundingStrong interference source

near the equipmentPoor ventilation, high

operating temperature

Transmitter or receiver

failure in LUPoor synchronizationPoor coordination between

XC and LU/TUFan failureFaulty boards or poor

performance

External causes Equipment failure

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page41

Bit ErrorsEquipment operator

Measure optical powerCheck cable or fiber connection and groundingClean fiber connectorCheck ventilation and temperatureHardware loopbackReplace boardExclude interference source

NMS operatorQuery and analyze alarms/ performance eventsLoopback section by sectionConfiguration modificationImplement switch

OperationsOperations

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page42

Bit Errors

11 22 33 44w w w wE E

LPBBE LPFEBBE

RSBBEMSBBEHPBBE

MSFEBBEHPFEBBE

Network Configuration Node 1 is the centralized services node. Each station has E1 services with node

1. Failure Description Too many bit errors

No-protection Line-CaseNo-protection Line-Case

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page43

Performance

event analysisCheck and exclude

external causes

Performance eventanalysis

LPBBE from 4 to 1RSBBE/MSBBE/HPBBE

from 4 to 3

LU firstthen TU

continue

Failure locates between 3 or 4

11 22 33 44w w w wE E

LPBBE LPFEBBE

RSBBEMSBBEHPBBE

MSFEBBEHPFEBBE

Bit ErrorsWhere is the Problem?Where is the Problem?

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page44

11 22 33 44w w w wE E

LPBBE LPFEBBE

RSBBEMSBBEHPBBE

MSFEBBEHPFEBBE

Performance

event analysis

Solve problems

continue

Check fans

and temperature

Normal

Yes

No

Normal

Yes

NoCheck and replacetransmitter/fiber/

connector/receiver

Measure or query

optical power

Bit ErrorsAnalysisAnalysis

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page45

11 22 33 44w w w wE E

LPBBE LPFEBBE

RSBBEMSBBEHPBBE

MSFEBBEHPFEBBE

Loopback &

ReplacementConnect

BER tester

Locate and replacethe faulty LU/XC

Loopback Active/standby XC switch Modify configuration

Bit Errors

Final SolutionFinal Solution

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page46

11 22 33 44w w w wE E

LPBBE LPFEBBE

RSBBEMSBBEHPBBE

MSFEBBEHPFEBBE

How to solve occasional bit errors?InterchangeYou can not loopback for a long timeFiber or LU

Question

Bit Errors

Think About It!Think About It!

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page47

Questions

What is the principle of troubleshooting?

External first, then internal

Station first, then boards

LU first, then TU

Higher-severity alarms first, then lower-severity alarms

What is the key of troubleshooting?

To locate a failure ACCURATELY in certain station

Copyright © 2006 Huawei Technologies Co., Ltd. All rights reserved. Page48

Summary

Which methods for troubleshooting?

1Alarm and performance analysis

2Loopback

3Replacement

4Configuration Data Analysis

5Configuration Modification

6Test with instruments

7Rule of Thumb

Thank youwww.huawei.com