Alm Rnc Oms Alarms

Nokia Siemens Networks WCDMA RAN, rel. RU20, operating documentation, issue 1

RNC OMS alarms

DN70398724

Issue 03BApproval Date 2009-11-30

2 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580654bd5

The information in this document is subject to change without notice and describes only the product defined in the introduction of this documentation. This documentation is intended for the use of Nokia Siemens Networks customers only for the purposes of the agreement under which the document is submitted, and no part of it may be used, reproduced, modified or transmitted in any form or means without the prior written permission of Nokia Siemens Networks. The documentation has been prepared to be used by professional and properly trained personnel, and the customer assumes full responsibility when using it. Nokia Siemens Networks welcomes customer comments as part of the process of continuous development and improvement of the documentation.

The information or statements given in this documentation concerning the suitability, capacity, or performance of the mentioned hardware or software products are given "as is" and all liability arising in connection with such hardware or software products shall be defined conclusively and finally in a separate agreement between Nokia Siemens Networks and the customer. However, Nokia Siemens Networks has made all reasonable efforts to ensure that the instructions contained in the document are adequate and free of material errors and omissions. Nokia Siemens Networks will, if deemed necessary by Nokia Siemens Networks, explain issues which may not be covered by the document.

Nokia Siemens Networks will correct errors in this documentation as soon as possible. IN NO EVENT WILL Nokia Siemens Networks BE LIABLE FOR ERRORS IN THIS DOCUMENTA-TION OR FOR ANY DAMAGES, INCLUDING BUT NOT LIMITED TO SPECIAL, DIRECT, INDI-RECT, INCIDENTAL OR CONSEQUENTIAL OR ANY LOSSES, SUCH AS BUT NOT LIMITED TO LOSS OF PROFIT, REVENUE, BUSINESS INTERRUPTION, BUSINESS OPPORTUNITY OR DATA,THAT MAY ARISE FROM THE USE OF THIS DOCUMENT OR THE INFORMATION IN IT.

This documentation and the product it describes are considered protected by copyrights and other intellectual property rights according to the applicable laws.

The wave logo is a trademark of Nokia Siemens Networks Oy. Nokia is a registered trademark of Nokia Corporation. Siemens is a registered trademark of Siemens AG.

Other product names mentioned in this document may be trademarks of their respective owners, and they are mentioned for identification purposes only.

Copyright © Nokia Siemens Networks 2009. All rights reserved

f Important Notice on Product Safety Elevated voltages are inevitably present at specific points in this electrical equipment. Some of the parts may also have elevated operating temperatures.

Non-observance of these conditions and the safety instructions can result in personal injury or in property damage.

Therefore, only trained and qualified personnel may install and maintain the system.

The system complies with the standard EN 60950 / IEC 60950. All equipment connected has to comply with the applicable safety standards.

The same text in German:

Wichtiger Hinweis zur Produktsicherheit

In elektrischen Anlagen stehen zwangsläufig bestimmte Teile der Geräte unter Span-nung. Einige Teile können auch eine hohe Betriebstemperatur aufweisen.

Eine Nichtbeachtung dieser Situation und der Warnungshinweise kann zu Körperverlet-zungen und Sachschäden führen.

Deshalb wird vorausgesetzt, dass nur geschultes und qualifiziertes Personal die Anlagen installiert und wartet.

Das System entspricht den Anforderungen der EN 60950 / IEC 60950. Angeschlossene Geräte müssen die zutreffenden Sicherheitsbestimmungen erfüllen.

DN70398724Issue 03B

3

RNC OMS alarms

Id:0900d80580654bd5

Table of ContentsThis document has 103 pages.

Summary of changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1 RNC OMS alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.1 70001 CONFIGURATION OF SNMP MEDIATOR IS OUT OF ORDER . 71.2 70002 INVALID SNMP TRAP COMMUNITY STRING . . . . . . . . . . . . . . 91.3 70003 NO REPLY TO SNMP REQUEST . . . . . . . . . . . . . . . . . . . . . . . 111.4 70004 UNKNOWN SNMP TRAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5 70005 INCORRECT ALARM DATA. . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.6 70007 AUTHENTICATION FAILURE IN ETHERNET DEVICE. . . . . . . 171.7 70011 NODE NOT RESPONDING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.8 70025 POSSIBLE SECURITY THREAT IN NETWORK ELEMENT . . . 221.9 70030 DISK DATABASE IS GETTING FULL . . . . . . . . . . . . . . . . . . . . 231.10 70064 BACKUP ERROR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.11 70110 CONFIGURATION OF NWI3 ADAPTER IS OUT OF ORDER. . 261.12 70111 FAILED TO CREATE NETACT CONNECTION . . . . . . . . . . . . . 291.13 70156 DISK DATABASE WATCHDOG START-UP FAILED . . . . . . . . 321.14 70157 CPU USAGE OVER LIMIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341.15 70158 FILE SYSTEM USAGE OVER LIMIT . . . . . . . . . . . . . . . . . . . . . 351.16 70159 MANAGED OBJECT FAILED. . . . . . . . . . . . . . . . . . . . . . . . . . . 371.17 70160 MEMORY USAGE OVER LIMIT. . . . . . . . . . . . . . . . . . . . . . . . . 421.18 70161 OPERATING SYSTEM MONITORING FAILURE . . . . . . . . . . . 431.19 70162 RAID ARRAY HAS BEEN DEGRADED . . . . . . . . . . . . . . . . . . . 441.20 70163 ETHERNET INTERFACE USAGE OVER LIMIT . . . . . . . . . . . . 451.21 70164 ETHERNET LINK FAILURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461.22 70166 MANAGED OBJECT LOCKED. . . . . . . . . . . . . . . . . . . . . . . . . . 471.23 70168 CLUSTER STARTED (RESTARTED) . . . . . . . . . . . . . . . . . . . . 481.24 70173 BACKEND DATABASE REQUIRED BY CORBA NAMING SER-

VICE IS UNAVAILABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491.25 70186 CLUSTER OPERATION INITIATED BY OPERATOR . . . . . . . . 521.26 70188 MANAGED OBJECT SHUTDOWN BY OPERATOR . . . . . . . . . 531.27 70189 MANAGED OBJECT UNLOCKED BY OPERATOR. . . . . . . . . . 541.28 70236 LDAP DATABASE CORRUPTED. . . . . . . . . . . . . . . . . . . . . . . . 551.29 70237 CORRUPTED LDAP DATABASE RECOVERED. . . . . . . . . . . . 581.30 70243 ALARM PROCESSOR CONFIGURATION IS OUT OF ORDER 601.31 70244 CORRUPTED ALARM DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . 621.32 70245 ILLEGAL INTERNAL USAGE OF EXTERNAL ALARM NOTIFICA-

TION FORMAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631.33 70246 ALARM SYSTEM HEARTBEAT . . . . . . . . . . . . . . . . . . . . . . . . 651.34 70247 ALARM SYSTEM HEARTBEATING SWITCHED OFF . . . . . . . 671.35 70256 RESOURCE ALLOCATION OR DE-ALLOCATION FAILURE . . 691.36 70265 RECOVERY ACTIONS BANNED FOR MANAGED OBJECT . . 721.37 70267 EXTERNAL USER ACCOUNT VALIDATION FAILED . . . . . . . . 741.38 70268 EXTERNAL LDAP FAILURE . . . . . . . . . . . . . . . . . . . . . . . . . . . 771.39 70269 INVALID ACTIVE SESSIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . 811.40 70280 UNKNOWN SPECIFIC PROBLEM. . . . . . . . . . . . . . . . . . . . . . . 84

4 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580654bd5

1.41 71000 PM FTP CONNECTION FAILED . . . . . . . . . . . . . . . . . . . . . . . . . 871.42 71001 MEASUREMENT DATA NOT TRANSFERRED . . . . . . . . . . . . . 881.43 71002 MEASUREMENT DATA ERROR . . . . . . . . . . . . . . . . . . . . . . . . 891.44 71003 OMS MEASUREMENT DATA PROCESSING OVERLOAD . . . . 901.45 71005 THRESHOLD MONITORING LIMIT EXCEEDED . . . . . . . . . . . . 911.46 71006 WCEL THRESHOLD MONITORING LIMIT EXCEEDED . . . . . . 921.47 71007 MEASUREMENT THRESHOLD MONITORING LIMIT EXCEEDED

931.48 71050 OMS EMT CONNECTION COULD NOT BE OPENED. . . . . . . . 951.49 71051 OMS EMT CONTROL CONNECTION FAILURE . . . . . . . . . . . . 961.50 71052 OMS FILE TRANSFER CONNECTION COULD NOT BE OPENED

971.51 71053 O&M SUPPORT FOR INTEGRATED 3RD PARTY DEVICES . . 981.52 71054 WCDMA BTS O&M MEDIATION FAILURE. . . . . . . . . . . . . . . . . 991.53 71055 NETWORK ELEMENT RESTARTED . . . . . . . . . . . . . . . . . . . . 1001.54 71057 RNW NOTIFICATION MISSING . . . . . . . . . . . . . . . . . . . . . . . . 1011.55 71088 MMI CONNECTION FAILURE. . . . . . . . . . . . . . . . . . . . . . . . . . 1021.56 71091 OVERFLOW ALARM FROM EXTERNAL SYSTEM . . . . . . . . . 103

DN70398724Issue 03B

5

RNC OMS alarms

Id:0900d80580654bd5

List of TablesTable 1 Valid and default attribute values of the NWI3 adapter configuration file .

26

6 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580654f99

Summary of changes

Summary of changesNote that the issue numbering system is changing. For more information, see Guide to WCDMA RAN operating documentation. Changes between document issues are cumu-lative. Therefore, the latest document issue contains all changes made to previous issues.

Changes between issues 03A and 03B

• Modified alarms:

– 71050 OMS EMT CONNECTION COULD NOT BE OPENED

– 71051 OMS EMT CONTROL CONNECTION FAILURE

Changes between issues 3-0 and 03A


– 71052 OMS FILE TRANSFER CONNECTION COULD NOT BE OPENED

– 71088 MMI CONNECTION FAILURE

Changes between issues 2-2 and 3-0

• New alarms:

– 70280 UNKNOWN SPECIFIC PROBLEM


– Name of alarm 70267 has been changed to EXTERNAL USER ACCOUNT VAL-IDATION FAILED.

DN70398724Issue 03B

7

RNC OMS alarms RNC OMS alarms

Id:0900d80580654be4

1 RNC OMS alarms

1.1 70001 CONFIGURATION OF SNMP MEDIATOR IS OUT OF ORDERProbable cause: Corrupt data

Event type: Processing error

Default severity: Minor

MeaningThe configuration of the SNMP mediator contains values that are unacceptable.

The invalid part of configuration is ignored. This causes partial loss of functionality. The SNMP traps may be lost.

Identifying additional information fieldsConfiguration entry

• The name and value of the attribute that is out of order under the fssnmpMediator-Name=1, fsFragmentId=SNMP, fsClusterId=ClusterRoot branch.

Additional information fields-

InstructionsUse the parameter management application to correct the configuration branch that is out of order. The Application Additional Information field displays the attribute or entry name that has an unacceptable value. For example, the following entry causes the alarm 70001, if xxx is not a hostname that can be resolved:

fssnmpNEId=xxx,fssnmpAttributeType=NEattrs,fssnmpMediatorName=1,fsFragmentId=SNMP,fsClusterId=ClusterRoot

Testing instructions section below provides instructions for creating the invalid entry.

ClearingThe alarm is cleared automatically by the alarm system after five minutes. If the config-uration is still out of order after that, the alarm is raised again.

Testing instructions

1. Open parameter management application and use it in the extended mode (select Browse > Mode > Extended Mode).

2. Add an invalid hostname to SNMP mediator’s LDAP configuration:a) Expand the entry tree below fsFragmentID=SNMP: In the parameter manage-

ment application main window, click the arrow next to the SNMP fragment in the entry tree (fsFragmentID=SNMP).

b) Click the arrow next to fssnmpMediatorName=1 to further expand the entry tree.c) Select fssnmpAttributeType=NEattrs and click the arrow next to it to display the

managed NEs.

8 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580654be4

RNC OMS alarms

d) Select Entry > New Child or right-click fssnmpAttributeType=NEattrs and select New Child.

e) In the Add new entry dialog box, enter any value for attribute fssnmpMOID and value xxx for fssnmpNEId.

f) Click OK and select Forced Activation in the Select Operation window.3. Restart /SNMPMediator.

Alarm 70001 with IAAI=”fssnmpNEId=xxx” is raised.

DN70398724Issue 03B

9

RNC OMS alarms

Id:0900d805802d4c1b

1.2 70002 INVALID SNMP TRAP COMMUNITY STRINGProbable cause: Corrupt data


Default severity: Warning

MeaningThe SNMP Mediator has received an SNMP trap that contains an invalid trap community string, that is, the community string in the trap does not match the community string in SNMP Mediator's configuration. The community strings are passwords that are used to authenticate the senders of SNMP traps.

Identifying additional information fields-

Additional information fields

1. IP address of the SNMP agent that sent the trap2. The received trap community string3. Version of the used SNMP, possible values are:

• SNMPv1 • SNMPv2c

4. Object identifier of the received trap

Instructions

1. Check the IP address of the SNMP agent that sent the trap. The IP address is dis-played in the Identifying additional information fields field #1 of the alarm

2. Check the community string that was received in the trap. The community string is displayed in the Application Additional Information field #1 of the alarm.

3. Use the parameter management tool to check the community string that the SNMP Mediator expects. Attribute fssnmpCommunityString of the following entry defines the community string:fssnmpTrapSource=<agent ip / hostname>,fssnmpAttributeType=Commstrings,fssnmpMediatorName=1,fsFragmentId=SNMP,fsClusterId=ClusterRoot

4. Modify the community string in the LDAP directory to match the community string received in the trap, or configure the SNMP agent to use the community string that the SNMP Mediator expects. Note that if no community string has been specified for an IP address in the LDAP, the SNMP Mediator accepts all community strings from that address.

ClearingClear the alarm with the alarm management application after correcting the fault as pre-sented in Instructions.


1. Open the parameter management application and use it in normal mode, when SNMP Mediator is running.

10 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805802d4c1b

2. Define the trap community for address CLA-0 to be -secret" by adding the following entry to SNMP mediator's LDAP configuration: dn:fssnmpTrapSource=CLA-0,fssnmpAttributeType=Commstrings,fssnmpMediatorName=1,fsFragmentId=SNMP,fsClusterId=ClusterRoot,fssnmpCommunityString: secret,fssnmpTrapSource: CLA-0,objectClass: FSSNMPTrapCommunityString,objectClass: top,objectClass: FSMOCBase

3. Log into CLA-0.4. Send a trap to SNMP Mediator with the following command:

# snmptrap -v 1 -c public SNMPMediator "" <CLA-0 IP address> 0 0 ""

Alarm 70002 INVALID SNMP TRAP COMMUNITY STRING withIAAI= <CLA-0 IP address> and AAI="public SNMPv1 .1.3.6.1.6.3.1.1.5.1" is raised.

DN70398724Issue 03B

11

RNC OMS alarms

Id:0900d805803b05c7

1.3 70003 NO REPLY TO SNMP REQUESTProbable cause: Corrupt data



MeaningSNMP Mediator has sent an SNMP request to an SNMP agent but it has not received a response.

• Example 1. A filter condition has been added for the authenticationFailure1.3.6.1.6.3.1.1.5.5 trap. Thus the following entry can be viewed by the parameter management tool:fssnmpV2TrapId=.1.3.6.1.6.3.1.1.5.5 fssnmpAttributeType=V2trapsfssnmpMediatorName=1,fsFragmentId=SNMP,fsClusterId=ClusterRoot The filter condition is defined by the attribute fssnmpFilterCondition. fssnmpFilterCondition may have, for example, the value (.1.3.6.1.2.1.1.1.0=*Linux*). See RFC 2254 for more information about the filter syntax.

Example 2. The SNMP Mediator receives the authenticationFailure trap that does not contain the value of variable .1.3.6.1.2.1.1.1.0. 3. The SNMP Mediator queries the value of .1.3.6.1.2.1.1.1.0 from the SNMP agent, but does not receive a response.

The SNMP is not able to handle the trap correctly, because it is not able to query or modify variables in the SNMP agent.


Additional information fieldsIP address of the SNMP agent that does not answer

Instructions

1. Check the IP address of the SNMP agent that sent the trap. The IP address is dis-played in the Application Additional Information field #1 of the alarm.

2. The net-snmp command line tools (snmpget, snmpset and so on) provided by the operating system may be used to verify the functionality of the SNMP agent.

3. To check the attributes defined for the SNMP agent, use the parameter manage-ment tool. The attributes are located under the following entry:fssnmpNEId=<agent IP / hostname>,fssnmpAttributeType=NEattrs,fssnmpMediatorName=1,fsFragmentId=SNMP,fsClusterId=ClusterRoot

4. Verify that the optional attribute fssnmpUDPPort has the value that the SNMP agent is listening to. The default value is 161.

12 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805803b05c7

5. Verify that the optional attribute fssnmpProtocolVersion is the same that the SNMP agent supports. The default value is V2c.

6. Verify that the optional attributes fssnmpReadCommString and fssnmpWriteCommString are the ones that the SNMP agent expects.



1. Open the parameter management tool and use it in normal mode, when SNMP Mediator is running.

2. Add entry "fssnmpV2Trapld=.1.3.6.1.6.3.1.1.5.1" under branch "fssnmpAttribute-Type=V2traps,fssnmpMediatorName=1,fsFragmentld=SNMP,fsClusterld=Cluster-Root"

3. Add attribute fssnmpFilterCondition to the entry created in step 2 and give it the value (.1.3.6.1.2.1.1.5.0=anystring) (The grammar for the filter condition is specified in http:/www.ietf.org/rfc/rfc2254.txt?number=2254)

4. Verify that there is no SNMP agent process such as snmpd running on CLA-0.#netstat -alp | grep snmptcp 0 0 *:smux *:*LISTEN 11017/snmpdudp 0 0 *:snmp *:*11017/snmpd# kill 11017root@CLA-0(GUI):~# netstat -alp | grep snmp

#5. Send a trap to SNMP Mediator with the following command (use the IP address of

CLA-0 as agent IP):# snmptrap -v 1 -c public SNMPMediator "" 192.168.128.1 0 0 ""

Alarm 70003 NO REPLY TO SNMP REQUEST is raised with AAI=192.168.128.1, because

• SNMP Mediator receives trap ".1.3.6.1.6.3.1.1.5.1", which does not contain the variable ".1.3.6.1.2.1.1.5.0" that is part of the filter condition.

• SNMP Mediator tries to get the value of ".1.3.6.1.2.1.1.5.0" from an SNMP agent running in address 192.168.128.1.

• SNMP Mediator does not get a response from 192.168.128.1, because no SNMP agent is running in the address.

DN70398724Issue 03B

13

RNC OMS alarms

Id:0900d805802d470b

1.4 70004 UNKNOWN SNMP TRAPProbable cause: Corrupt data



MeaningThe SNMP Mediator has received an SNMP trap that it is unaware of. The trap is unknown to the SNMP Mediator, if 1) the IP address of the SNMP agent that sends the trap is missing from the SNMP Mediator's configuration, or 2) the OID (object identifier) of the trap is unknown to the SNMP Mediator.

1. Unknown traps may contain information that could be useful.2. Unnecessary traps waste network capacity.


Additional information fields1. IP address of the SNMP agent that sent the trap

2. Version of the used SNMP, possible values:

• SNMPv1 • SNMPv2c

3. Object identifier of the received trap

Instructions

1. Using the parameter management application, check that the IP address of the SNMP agent is stored in the SNMP Mediator's configuration. An entry of the follow-ing format should be found:fssnmpNEId=<agent IP or hostname>,fssnmpAttributeType=NEattrs,fssnmpMediatorName=1,fsFragmentId=SNMP,fsClusterId=ClusterRoot

2. If the trap is unnecessary, check whether there is a way to disable the sending of the trap in the SNMP agent or use filtering in the SNMP Mediator. The SNMP Mediator may be configured to filter out traps by adding an entry of the following format:fssnmpV2TrapId=<trap OID> fssnmpAttributeType=V2traps,fssnmpMediatorName=1,fsFragmentId=SNMP,fsClusterId=ClusterRootIf the above entry without attributes exists in the configuration, the SNMP Mediator will ignore the trap and no alarm is raised. Additionally, filtering attributes fssnmpAcceptFrom or fssnmpDiscardFrom may be used to define the IP addresses from where the trap should be accepted or ignored. Attribute fssnmpFil-terCondition may be used for filtering away traps based on variables within the trap itself. See RFC 2254 for information about the filter syntax ("approx", "extensible" and "escaping mechanism" are not supported).

3. If the trap contains important information, the implementation of the SNMP Mediator should be updated. The rules that define what the SNMP Mediator does when it

14 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805802d470b

receives traps are part of the implementation. Fill in a problem report and send it to your local Nokia Siemens Networks representative.



1. Log into the active CLA.2. Send coldStart trap to SNMP Mediator by using agent IP that is not in SNMPMedi-

ator's configuration (127.0.0.1):# snmptrap -v 1 -c public SNMPMediator "" 127.0.0.1 0 0 ""

3. Alarm 70004 UNKNOWN SNMP TRAP with AAI=127.0.0.1 and AAI= "SNMPv1 .1.3.6.1.6.3.1.1.5.1" is raised.

DN70398724Issue 03B

15

RNC OMS alarms

Id:0900d805803f7d2d

1.5 70005 INCORRECT ALARM DATAProbable cause: Invalid parameter


Default severity: Major

MeaningThe alarm system has been requested to raise or clear an alarm with incorrect alarm data. One or more arguments provided with the request might have an invalid value or meaning:

• null • empty • too long • out of specified range • contain non-printable characters • have an incorrect format

The alarm number (Specific Problem) might also be unknown. An incorrect format in this case means, for example, that a character value was entered where a numeric value was expected. A special case of an incorrect format is if the quotes (") surrounding the value of an information field are missing from an alarm notification record in the syslog.

The alarm which is requested to be raised or cleared with incorrect data is not processed further but the information is put as additional information in this alarm. If the alarm number is unknown, then the actual fault for which the alarm has been raised is also left unknown.

Identifying additional information fields1. Erroneous data

• Identifies the alarm data that was incorrect or that was totally missing. Only the name of the first field containing invalid data is mentioned here. Possible values are: • SP: Specific Problem given in the data is not known by the alarm system, or is

not reasonable; • MOId: Managed Object Id given in the data is not reasonable; • PS: Perceived Severity given in the data is not reasonable; • applId: Application Id given in the data is not reasonable; • AAI: Additional Information given in the data is not reasonable; • IAAI: Identifying Additional Information given in the data is not reasonable; • alarmTime: Alarm time is presented in too long a format, or is in non-numerical

format; • length: The combined length of the string type fields (Managed Object Id, Appli-

cation Id, Application Additional Information, Identifying Application Additional Information) given in the data exceeds the maximum value of 896 characters. Note that in this case, both Application Id and Managed Object Id in the given data are considered as invalid, as only the combined length is verified.

• In addition, these values are also possible for RNC alarms: • rncLocalMOId: the Local Managed Object Id given in the data is not reasonable; • rncApplicationId: the RNC Application Id given in the data is not reasonable; • rncNotificationId: the RNC Notification Id given in the data is not reasonable;

16 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805803f7d2d

• rncFlowControl: the RNC Flow Control given in the data is not reasonable.

2. Specific Problem

• Specific problem (the alarm number) of the invalid alarm can also contain the original invalid value if this was the invalid field.

Additional information fieldsManaged Object Id

• Distinguished name of the managed object that was given as the Managed Object Id in the invalid alarm. If the MOId itself was the incorrect data, then the value fsManagedObjectId=invalid, fsClusterId=ClusterRoot is displayed in this field.

InstructionsFill in a problem report and send it to your local Nokia Siemens Networks representative.

ClearingClear the alarm with the alarm management application after correcting the fault as pre-sented in Instructions, in other words, after sending the report to your local Nokia Siemens Networks representative.

Testing instructionsUse, for example, the alarm system command line interface (CLI) command flexalarm to send a request to raise or clear an alarm with a Specific Problem that does not exist.

For example:

$> flexalarm -raise -mo=<myMO> -ap=<myAP> -sp=700111

where <myMO> and <myAP> have the correct format.

Since the 700111 Specific Problem does not exist, alarm 70005 is raised.

DN70398724Issue 03B

17

RNC OMS alarms

Id:0900d805803c315d

1.6 70007 AUTHENTICATION FAILURE IN ETHERNET DEVICEProbable cause: Protection path failure

Event type: Equipment


MeaningAn Authentication Failure SNMP trap signifies that the sending protocol entity is the addressee of a protocol message that is not properly authenticated. The agent on an Authentication failure generates this trap. The SNMP Trap is generated when some actor tries to request the SNMP queries with wrong authentication methods/keys. This authentication key is called the community string in SNMP. This is most likely someone with a misconfigured SNMP manager or MIB browser, but it may indicate malicious activity, that is, some malicious user trying to obtain information by sending an SNMP request. It does not get triggered for CLI (Command Line Interface)/Web login failures.

The SNMP request will fail and no information will be returned.

Identifying additional information fieldsIP address

• The trap was generated because of this IP address entity had wrong community string.


InstructionsIn case when there is no misconfigured SNMP managers there is a danger that some entity is inside the network without an authorization and this actor must be found. This entity can be identified from the authentication failure SNMP trap sent by SNMP agent.

In case of misconfigured SNMP configuration in manager, the SNMP community string must be updated.



1. Log into the switch. For example: [root@CLA-0(MIKAEL_R_FSPR4EDC_1.9) /root]# ssh switch-1Linux swsea 2.4.17_mvl21-swsea #1 Wed May 17 11:59:44 CDT 2006 ppc unknownLinux swsea 2.4.17_mvl21-swsea #1 Wed May 17 11:59:44 CDT 2006 ppc unknown

2. 2. Start the swc command line tool:root@swsea@1-1-8:~# swc(RadiSys SWSE-A Switch) >

3. Display the community strings by "show snmpcommunity":(RadiSys SWSE-A Switch) >show snmpcommunity

18 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805803c315d

4. Exit the switch:(RadiSys SWSE-A Switch) >quitThe system has unsaved changes.Would you like to save them now? (y/n) nroot@swsea@1-1-8:~# exitlogoutConnection to switch-1 closed.

5. Perform an SNMP Get request with a valid community string:# snmpget -c tstcomm -v 2c switch-1 system.sysDescr.0SNMPv2-MIB::sysDescr.0 = STRING: RadiSys SWSE-A Switch

6. Perform an SNMP Get request with an invalid community string:# snmpget -c invalid -v 2c switch-1 system.sysDescr.0SNMPv2-MIB::sysDescr.0 = STRING: RadiSys SWSE-A SwitchAlarm 70007 will be raised after step 6 due to the invalid community string.

SNMP Com-munity Name

Client IP Address

Client IP Mask

Access Mode Status

tstcomm 192.168.128.1

0.0.0.0 Read Only Enable

com 192.168.128.1

0.0.0.0 Read Only Enable

DN70398724Issue 03B

19

RNC OMS alarms

Id:0900d8058043d853

1.7 70011 NODE NOT RESPONDINGProbable cause: Equipment malfunction



MeaningA physical computing node has not restarted despite of restart attempts. The node may be broken, is unable to restart, or is stuck.

Any important services/functions that are provided with an active-standby recovery group may have been taken over by other operational nodes. Services may be down if standby nodes are also down.


Additional information fieldsAny further information if available.

InstructionsPerform the following steps to verify the state of the node:

1. Log into the cluster as root user. 2. Use the hwcli command to verify the state of the node. For example, the state of

the node /CLA-1 can be checked as follows:$ hwcli CLA-0

CLA-1: available (FlexiSvr CPI1 000157:0108 01.02)

3. Previous hwcli output shows that the CLA-0 node is physically available. The high availability services (HAS) of the system attempts, after about 30 minutes, to restart a failed node by issuing a power-off, power-on and restart sequence. If you do not want to wait for this, you can perform the power-off, power-on and restart sequence manually.For example:

$ hwcli --power off CLA-0ATTAMPTING TO POWER OFF NODECLA-0ARE YOU SURE YOU WANT TO PROCEED? yesPowering off CLA-0: OK$ hwcli --power on CLA-0Powering on CLA-0: OK$ hwcli --reset CLA-0ATTAMPTING TO RESET NODECLA-0ARE YOU SURE YOU WANT TO PROCEED? yesResetting CLA-0: OK

4. If the node does not start within a few minutes or the hwcli does not show that the node is available, check if the CPU board has any error lights on. If it does, you can try to restore the node into service by removing and re-inserting the node.

20 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058043d853

5. Contact your Nokia Siemens Networks representative even if these operations bring the node up, because it is possible that the computing node needs to be replaced or it may, for example, need a BIOS upgrade.

ClearingThe system clears the alarm automatically when the fault has been corrected.


1. Power-off an operational unlocked node using hwcli. You can check the state of the node using fshascli. For example,

$ fshascli --state /AS-1/AS-1administrative(UNLOCKED) <== Unlockedoperational(ENABLED) <== Operationalusage(IDLE)procedural()availability()unknown(FALSE)alarm()$ hwcli --power off AS-1ATTEMPTING TO POWER OFF NODE AS-1ARE YOU SURE YOU WANT TO PROCEED? yes Powering off TA-A: OK

2. Wait for the node to change its state to DISABLED. By default, the alarm is raised about 10 minutes after the node has been declared faulty because attempts to restart it have failed. A faulty node has OFFLINE and FAILED in the availability status. For example,

$ fshascli --state /AS-1/AS-1administrative(UNLOCKED) <== Unlockedoperational(DISABLED) <== Not operationalusage(IDLE)procedural(INITIALIZING)availability(OFFLINE) <== Not yet failedunknown(FALSE)alarm(MAJOR,OUTSTANDING)$ sleep 11m$ fshascli --state /AS-1/AS-1administrative(UNLOCKED) <== Unlockedoperational(DISABLED) <== Not operationalusage(IDLE)procedural(NOTINITIALIZED)availability(OFFLINE,FAILED) <== FAILED!unknown(FALSE)alarm()

The alarm raising is also visible in the syslog as a message that begins as follows: ALARM RAISE SP=70011 . . .

DN70398724Issue 03B

21

RNC OMS alarms

Id:0900d8058043d853

3. The alarm is automatically cancelled when the node has successfully restarted. Issue a power-on for the node using hwcli and wait for the node restart to com-plete. For example,

$ hwcli --power on AS-1Powering on AS-1: OK$ sleep 3m$ fshascli --state /AS-1/AS-1administrative(UNLOCKED) <== Unlockedoperational(ENABLED) <== Operationalusage(IDLE)procedural()availability() unknown(FALSE)alarm()

The alarm cancellation is also visible in the syslog as a message that begins as follows:

ALARM CANCEL SP=70011 . . .

22 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058038eeed

1.8 70025 POSSIBLE SECURITY THREAT IN NETWORK ELEMENTProbable cause: Threshold crossed

Event type: Quality of Service


MeaningThere is reason to suspect that someone is trying to intrude a network element. This condition emerges if there are too many wrong login attempts.



InstructionsSecurity log data must be checked. Investigate specially login entries made just before alarm was raised.

ClearingAfter correcting the fault as presented in Instructions, clear the alarm with the alarm management application.

Testing instructionsPrerequisites for the testing: Make an internal test account (i.e., to reside in the network element's LDAP server by using either the parameter management application or the fsuseradd CLI command) and set its password.

1. Log into a node with ssh and with a valid user account and password so that a session is successfully started.

2. Log out from the node.3. Log in with the same user account but with a wrong password the predefined

number of times (for the number, please see the file /etc/pam.d/ssh its row"/opt/Nokia_BP/lib/security/$ISA/PamAlarm.so file=/var/log/faillog alarmThreshold=<number> validfor=internal" in which the threshold is defined with the parameter alarmThreshold=<threshold_for_number_of_failed_logins>").The default value for the needed subsequent failed logins is 5. Make sure that there are no successful logins for the user between the failed ones.An alarm should be raised after the predefined number of failed logins Check the alarm list with the alarm management application.Tip: You can also use Element Manager instead of ssh for the test.

DN70398724Issue 03B

23

RNC OMS alarms

Id:0900d80580438a73

1.9 70030 DISK DATABASE IS GETTING FULLProbable cause: Storage capacity problem



MeaningThe disk storage area reserved for disk database is filling up.

The disk database is still fully operational. If the database fills up completely, its services cannot be properly used anymore.



1. Max size: the maximum size of database in kB2. Fill ratio: the fill ratio of the database (the percentage of how much is filled from the

database)

InstructionsThe actions to be done in order to avoid a completely full database are database-spe-cific, so contact your local Nokia Siemens Networks representative immediately and provide them with the information you obtained from the alarm notification's fields.


Testing instructionsYou can test the alarm either by filling the database until the allocated space exceeds the fill ratio alarm limit, or by decreasing the fill ratio alarm limit under the current fill ratio of the database. You can also combine these two approaches.

• In the first approach, you simply create a dummy table to the database and insert rows to it until the fill ratio exceeds the fill ratio alarm limit (see attribute fsdbFillRatioAlarmLimit in the DB fragment in LDAP - Lightweight Directory Access Protocol).

• In the second approach, you must use a parameter management tool to change the fsdbFillRatioAlarmLimit attribute of the DB fragment to a smaller value than the current fill ratio of the database. After this, you must restart the recovery group of the database (fshascli -r /<RG>). The current fill ratio of the database can be estimated as follows:1. Get the maximum size of the database either by checking the

innodb_data_file_path attribute from the MySQL instance configuration file (/var/mnt/local/MySQL_<DBName>/my.cnf) or by connecting to the instance and entering the following command:SHOW GLOBAL VARIABLES LIKE 'innodb_data_file_path'\GThe maximum size is the sum of the maximum size of each InnoDB data file listed in the value. For example, the following result means that the maximum size is 500 MB (512'000 kB):

24 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580438a73

*************************** 1. row ***************************Variable_name: innodb_data_file_path Value: ibdata1:500M

2. Get the free space of the database by connecting to the instance and entering the following command for any InnoDB table:

SHOW TABLE STATUS FROM <schema> LIKE '<table>'\Gwhere <schema> is the schema name of the InnoDB table and <table> is the name of the table. The comment column of the result set shows the free space. For example, the following result means that the database has 492'544 kB free space (when using the example size of step 1, the result leads to fill ratio of 3,8%):

mysql> SHOW TABLE STATUS FROM test LIKE 'mysqlwdtest'\G*************************** 1. row *************************** Name: mysqlwdtest... Comment: InnoDB free: 492544 kB

It does not matter which InnoDB table is used in the query. 3. Check the schema and the name of an arbitrary InnoDB table by using the fol-

lowing query:SELECT table_schema,table_name FROM information_schema.tables WHERE engine = 'InnoDB' LIMIT 1;

DN70398724Issue 03B

25

RNC OMS alarms

Id:0900d805802f1c90

1.10 70064 BACKUP ERRORProbable cause: Application subsystem failure



MeaningBackup has failed because of a fatal error or it has been interrupted.

As a result, either the backup archive does not exist or it is corrupted and unusable.

Identifying additional information fields1. Backup log file. Identifies the name of the backup log file without the path.

The format is BUTYPE_$BASE_$DATE, where $BUTYPE is either "FULL", "PARTIAL" or "CUSTOM", $BASE is the name of the base delivery or hostname (if flexiserver link is not present in the system), and $DATE is current date in the format YYYYMMDD_HHMMSS.


Instructions

1. Locate the backup log from /var/mnt/local/backup/SS_Backup. The name of the log file is given in the alarm.

2. See the backup summary at the end of the log.3. Search the log contents for "ERROR" and "WARNING" statements to see which

backup module has failed.4. Refer to the backup and restore troubleshooting instructions.5. If the backup has failed before the log file has been created, search the syslog for

the latest fsbackup entries.6. After the failure, re-execute the backup.

However, if the failure was caused by incorrect environment and/or configuration, refer to backup and restore troubleshooting instructions and correct the environment and/or configuration before re-executing the backup.

ClearingClear the alarm with an alarm management application after correcting the fault as pre-sented in Instructions.


1. Start a partial backup. For example:fsbackup -p -v

2. Interrupt the process by pressing Ctrl-C.The backup process raises an alarm.

Or

1. Lock a database recovery group (for example, TimesTen and Solid)2. Execute custom backup, for example:

fsbackup -d -vThe backup process raises an alarm.

26 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058036b134

1.11 70110 CONFIGURATION OF NWI3 ADAPTER IS OUT OF ORDERProbable cause: Configuration or customizing error



MeaningThe configuration file of NWI3 adapter contains invalid attribute values. Depending on the release, the configuration is stored only in files or files and LDAP (Lightweight Direc-tory Access Protocol).

The system ignores the invalid parameters and uses the default values or the closest acceptable value. For example, the value 2000 is greater than the highest acceptable value (1440) for heartbeatPeriod (see the table in the Instructions) and causes this alarm. In this case, 1440 would be used as the heartbeatPeriod.

Identifying additional information fieldsAttribute name: name of the attribute that has an invalid value

Additional information fieldsFile path: the path of the file that includes invalid attribute values; or LDAP branch: the LDAP branch that includes invalid attribute values

Instructions

1. Correct the invalid attribute value. The attribute name is displayed in the Identifying additional information field. The name of the configuration file is displayed in the Additional information field. The attributes that can cause this alarm are mainly stored in file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini or LDAP branch fsFragmentId=mediator, fsFragmentId=NWI3, fsClusterId=ClusterRoot. The valid as well as default values of these attributes are presented in the table below. The attribute names in LDAP are prefixed with fsnwi3.

Name Type Default

(fsnwi3)takeIntoUseNext boolean: (0=false,1=true) in nwi3mdcorba.ini and (false,true) in LDAP

0

(fsnwi3)registrationServiceIOR string, a valid IOR to NetAct’s registration service

empty string

(fsnwi3)heartbeatPeriod short: [0..1440] minutes, granularity:1 minute

15

(fsnwi3)reRegistrationPeriod short: [15..1440] minutes, granularity:1 minute

60

(fsnwi3)registrationRetryBasePeriod short: [5..240] minutes, gran-ularity:1 minute

15

(fsnwi3)retryRandom short: [5..240] minutes, gran-ularity:1 minute

5

Table 1 Valid and default attribute values of the NWI3 adapter configuration file

DN70398724Issue 03B

27

RNC OMS alarms

Id:0900d8058036b134

2. This alarm can also be caused by the parameter mediatorSessionManagerIOR located in file /var/opt/Nokia/www/SessionManager_V1.ior.Restart the NWI3 adapter to generate mediatorSessionManagerIOR into SessionManager_V1.ior. In normal conditions, the restart generates the param-eter with valid value.

3. If the problem is the results from the parameter systemID in file /var/opt/Nokia/www/systemid.txt, the probable cause is that the file systemid.txt is missing. The value in systemID should be the same as in the file /etc/cluster-id. Copy /etc/cluster-id to /var/opt/Nokia/www/systemid.txt and restart the NWI3 adapter.

ClearingClear the alarm with alarm management application after correcting the fault as pre-sented in Instructions.


1. If file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini exists, set the following content to it (no value for registrationServiceIOR and takeIntoUseNext=1):

[DN:N3CF-1]objectClassVersion=1N3CFId=1objectClass=N3CFconfigurationActive=0takeIntoUseNext=1registrationServiceIOR=registrationServiceUsername=NemuadminregistrationServicePassword=nemuuserheartbeatPeriod=15reRegistrationPeriod=60registrationRetryBasePeriod=15retryRandom=5rePublicationPeriod=3getPublicationServiceRetryPeriod=15userLabel=

2. If branch fsFragmentId=mediator, fsFragmentId=NWI3, fsClusterId=ClusterRoot exists in the LDAP, use parameter management application for creating a new child to the branch. Enter the following attributes in the Add New Entry dialog: • fsnwi3N3CFId=1 • takeIntoUseNext=1

(fsnwi3)rePublicationPeriod short [1..60] minutes, granu-larity:1 minute

3

(fsnwi3)getPublicationServiceRetry-Period

short [1..60] minutes, granu-larity:1 minute

15

Name Type Default

Table 1 Valid and default attribute values of the NWI3 adapter configuration file

28 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058036b134

3. Restart NWI3Adapter.

If file nwi3mdcorba.ini was modified in step 1, alarm 70110 with IAAI= registration-ServiceIOR and AAI=/var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini is raised. If LDAP was modified in step 1, alarm 70110 with IAAI= fsnwi3registrationServiceIOR and AAI= fsnwi3N3CFId=1,fsFragmentId=mediator,fsFragmentId=NWI3,fsClus-terId=ClusterRoot is raised.

DN70398724Issue 03B

29

RNC OMS alarms

Id:0900d8058050aef9

1.12 70111 FAILED TO CREATE NETACT CONNECTIONProbable cause: Connection establishment error

Event type: Communications


MeaningThe NWI3 adapter failed to register to Nokia NetAct.

NetAct cannot subscribe to notifications or be used for managing the network element (NE) via NWI3.


Additional information fieldsDepending on the release

N3CFId: the naming attribute of the active N3CF instance in file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini; or Distin-guished name of the active N3CF instance in LDAP.

Instructions

1. If file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini exists:a) Make sure that the NetAct Registration Service IOR (parameter registrationSer-

viceIOR) is filled in file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini and check the correctness of the IOR. The command printIOR <IOR> can be used for viewing the IP address and port included in the IOR.

b) Verify that there is a valid username (parameter registrationServiceUsername) and password (registrationServicePassword) to the registration service of NetAct in file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini.

c) Check the value of the takeIntoUseNext parameter in the nwi3mdcorba.ini file. The value of the parameter in an active section should be 1, and the value of the configurationActive parameter should also be 1. The system sets the value of the configurationActive parameter automatically to 1 when a parameter set is taken into use.

2. If file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini does not exist and NWI3 adapter's configuration is stored under branch fsFragmen-tId=mediator, fsFragmentId=NWI3, fsClusterId=ClusterRoot in the LDAP:a) Verify that there is an LDAP entry fsnwi3N3CFId=<id>>,fsFragmentId=media-

tor, fsFragmentId=NWI3 with attribute fsnwi3takeIntoUseNext=true, which defines the active attribute set.

b) Make sure that the NetAct Registration Service IOR (attribute fsnwi3registrationServiceIOR) has been specified for the active set and check the correctness of the IOR. Command printIOR can be used for viewing the IP address and port included in the IOR.

30 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058050aef9

c) If attributes fsnwi3NEAccountUsername and fsnwi3NEAccountPassword exist under branch fsFragmentId=security, fsFragmentId=NWI3, they are used for NetAct registration. Verify that they are valid.

d) If attributes fsnwi3NEAccountUsername and fsnwi3NEAccountPassword do not exist under branch fsFragmentId=security, fsFragmentId=NWI3, the initial username (attribute fsnwi3initialRegistrationUsername) and password (fsnwi3initialRegistrationPassword) defined in the active set are used for NetAct registration. Verify that they are valid.

3. Verify that NetAct is up and running and check the connection between the NE and NetAct. Ping NetAct from the node where the NWI3 adapter is running: ping -I <node's external IP address> <NetAct's IP (see step 1)>.

4. Check that the NetAct hostname is configured in the external domain name system (DNS) in use.

ClearingThe alarm system clears the alarm automatically after the fault has been corrected.


1. If file /var/opt/Nokia/SS_Nwi3Adapter/config/nwi3mdcorba.ini exists:a) Set the following content to it (a valid registrationServiceIOR of a non-existent

NetAct object) and takeIntoUseNext=1):[DN:N3CF-1]objectClassVersion=1N3CFId=1objectClass=N3CFconfigurationActive=0takeIntoUseNext=1registrationServiceIOR=IOR:000000000000002449444c3a4e5749332f526567697374726174696f6e536572766963655f56313a312e3000000000010000000000000064000102000000000e3137322e32312e3232302e3631009c3f0000002400504d43000000040000000a2f4e65744163745253002020000000084e65744163745253000000025649530300000005000507017d00000000000000000000080000000056495300registrationServiceUsername=NemuadminregistrationServicePassword=nemuuserheartbeatPeriod=15reRegistrationPeriod=60registrationRetryBasePeriod=15retryRandom=5rePublicationPeriod=3getPublicationServiceRetryPeriod=15userLabel=

b) Verify that NetAct's registration service is not running in the IP address and port defined by registrationServiceIOR.

c) Restart NWI3Adapter.Alarm 70111 with AAI=1 is raised.

2. If branch fsFragmentId=mediator, fsFragmentId=NWI3, fsClusterId=ClusterRoot exists in the LDAP:

DN70398724Issue 03B

31

RNC OMS alarms

Id:0900d8058050aef9

a) Use parameter management tool for creating a new child to the branch. Enter the following attributes in the Add New Entry dialog: • fsnwi3N3CFId=1 • takeIntoUseNext=1

• fsnwi3registrationServiceIOR= IOR:000000000000002449444c3a4e5749332f526567697374 726174696f6e536572766963655f56313a312e300000000001 0000000000000064000102000000000e3137322e32312e3232 302e3631009c3f0000002400504d43000000040000000a2f4e 65744163745253002020000000084e65744163745253000000 025649530300000005000507017d000000000000000000000 80000000056495300

b) Restart NWI3Adapter.Alarm 70111 with AAI="fsnwi3N3CFId=1,fsFragmentId=mediator,fsFragmen-tId=NWI3,fsClusterId=ClusterRoot" is raised.

32 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058036a131

1.13 70156 DISK DATABASE WATCHDOG START-UP FAILEDProbable cause: Configuration or Customizing Error


Default severity: Critical

MeaningStart-up of the disk database watchdog has failed due to a configuration error, or other reasons.

Because the disk database and its watchdog belong to the same Recovery Unit (RU), the disk database watchdog start-up failure means that the database is not available.



1. Reason. Possible values: • Disk database watchdog failed to read the parameters from the parameter man-

agement system. • Invalid or missing parameter value.

2. List of invalid or missing parameters if the reason for the alarm is 2.

InstructionsCheck the Application Additional Information field for a reason for the configuration error:

• Reason 1: Disk database watchdog failed to read the parameters from parameter management

• Reason 2: Invalid or missing parameter value

Continue according to the following procedure:

1. Check that the following parameters exist in parameter management for each database entry in the database fragment with the DN (Distinguished Name) "fsFrag-mentId=DB, fsClusterId=ClusterRoot":

fsdbRedundancyModelfsdbDataSourceNamefsdbFillRatioAlarmLimitfsdbFillRatioCheckFreq

2. Use parameter management application to get the values of those parameters for the database in question. To find those parameters, use the value of the Managed Object field in alarm management application, for example:

fsdbName=DB_Alarm,fsFragmentId=DB,fsClusterId=ClusterRoot3. Send the found values and/or parameters that do not exist (parameters for which

the fields are empty) to your local Nokia Siemens Networks representative.

ClearingClear the alarm using alarm management application after correcting the fault.

DN70398724Issue 03B

33

RNC OMS alarms

Id:0900d8058036a131


1. 1. use a parameter management application to change the fsdbFillRatioAlarmLimit or fsdbFillRatioCheckFreq attribute of the database to a non-numeric value

2. restart the recovery group of the database.

34 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580331c00

1.14 70157 CPU USAGE OVER LIMITProbable cause: Threshold crossed

Event type: Quality of service


MeaningA processor is being used at a very high throughput level because the execution of some processes is taking a lot of CPU time.

There is a risk that the node is unable to fulfill the tasks allocated to it. This depends on to what extent the processes taking the most of the CPU time are blocking other pro-cesses from getting runtime on the CPU, and whether there is a temporary or a perma-nent increase on the throughput.

If the processor is constantly used at a very high throughput level, the system might appear very slow. For example, the execution of commands takes an unusually long time to finish.

Identifying additional information fields1. CPU index (optional).


Instructions

1. RuntopLinux command on the node that reports the alarm. The command gives a repetitive update of processor activity in real time. It gives a listing of the most CPU-intensive tasks of the system.

2. If the problem persists, contact your local Nokia Siemens Networks representative and provide the information gathered in the previous step.

ClearingThe alarm is cleared automatically by the operating system's fault detector once the CPU usage is on a low enough level. The raising / clearing thresholds are different to prevent unnecessary trashing.

Testing instructionsDo not test this alarm, because testing it will result in reduced quality of service.

DN70398724Issue 03B

35

RNC OMS alarms

Id:0900d8058034faea

1.15 70158 FILE SYSTEM USAGE OVER LIMITProbable cause: Threshold Crossed



MeaningThe available disk space on a partition is smaller than the minimal requirement. The par-tition can be filled up, for example, by crashing programs resulting large core files or by large log files, if the rotation of logs does not function.

There is a risk that some data cannot be written to the disk.

Identifying additional information fieldsMountpoint


Instructions

1. Run the df -k <mountpoint>Linux command on the node that reports the alarm to get a report of the usage of the file system disk space in 1 kilobyte blocks.See the mountpoint in the Identifying additional information fields of the alarm.Alternatively, run the Linux commanddf -h <mountpoint>to see the information in a human readable format.

2. Run the Linux command du -k or du -h on the node that reports the alarm to disocver the directories that consume most of the space.

3. Check with du -h /var/tmp/. if /var/tmp is among the large directories. If it is, remove the unnecessary files.

4. Check with du -h /var/log/.if /var/log is among the large directories. If it is, move the old files outside the Network Element (NE) using the appropriate network management tools.

5. Check with du -h /var/crash/. if /var/crash is among the large directories. If it is, move the core files outside the NE using the appropriate network management tools.

6. If the alarm is not cleared, contact your local Nokia Siemens Networks representa-tive.

ClearingThe alarm is automatically cleared by the operating system's fault detector once the amount of available disk space increases above the specified limit. The raising / clearing thresholds are different to prevent unnecessary trashing.

36 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058034faea

Testing instructionsDo not test this alarm, because testing it in a live system will reduce the quality of service.

DN70398724Issue 03B

37

RNC OMS alarms

Id:0900d80580292626

1.16 70159 MANAGED OBJECT FAILEDProbable cause: Software program abnormally terminated



MeaningThe named managed object (MO) has failed. The managed object can be a software, hardware or logical entity. The type of the managed object identifies the following:

• Node: The physical computing node, its system software, or operating system has failed, or the node has been manually restarted.

• Recovery Unit (RU): A recovery unit contains one or more processes. A recovery unit failure is usually caused by a process failure.

• Process: The process has crashed, terminated abnormally or stopped responding. • Recovery Group (RG): A recovery group consists of one or more recovery units. A

recovery group failure alarm is raised for an active-standby configuration, when both redundant components (recovery units of the recovery group) have failed. This is always a serious situation as it indicates a double failure (for example, two nodes have failed at the same time).

The effect of the situation depends on the managed object type:

• Node: Any important services/functions that are provided with an active-standby or N+M recovery group may be taken over by other operational nodes. Services may be down if standby/spare nodes are also down.

• Recovery Unit (RU): If the recovery unit belongs to an active-standby or N+M recovery group, the service may be taken over by an operational standby/spare recovery unit.

• Process: The service or function that the process provides is not available. A process failure can cause a recovery unit level recovery action or the system may attempt to restart the failed process.

• Recovery Group (RG): The service provided by the recovery group is not available. Manual correction is required, as the automatic system repair actions have not solved the problem.

The system High Availability Services (HAS) will periodically attempt to solve the problem with corrective actions, such as switchovers or restarts. The alarm system also clears the obsolete alarms that may have been raised by this managed object or by its child managed objects.



1. Identifies the managed object type: "Node", "Recovery unit", "Process" or "Recovery group".

2. Explains the string of the fault type (if that information is available) or just the string "failure".For example: "Process has stopped responding to heartbeats""Node connection heartbeat failure""Recovery group failure"

38 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580292626

Instructions

1. Log into the cluster and check that the named managed object has been success-fully restarted.

2. Verify also that the MO did not raise any new alarms that would explain the failure.

You can check the status of an MO with the HAS user interface tool fshascli. An opera-tional MO has the value ENABLED in the operational state attribute and an empty pro-cedural status attribute.

For example, the state of the process NodeDNS in the recovery unit FSNodeDNSServer of the node AS-5 can be seen as follows:

$ fshascli --status /AS-5/FSNodeDNSServer/NodeDNS /AS-5/FSNodeDNSServer/NodeDNS:administrative(UNLOCKED) operational(ENABLED) usage(ACTIVE) procedural() availability( ) unknown(FALSE) role(ACTIVE)

If the MO is not operational, perform the following steps:

1. With a node MO, you can wait for a node restart. The system will raise another alarm (70011 NODE NOT RESPONDING) if the node does not come up within some time.

2. Check the system logs (/var/log/master-syslog on the active CLA node) for error(s) that have occurred by searching for the MO's name and/or by looking at events that occurred before this alarm was raised.

3. You can also use the HAS user interface tool to initiate an immediate restart attempt of the failed MO using the -r (--restart) command line option:

$ fshascli --restart /AS-5/FSNodeDNSServer

The restart operation is mostly useful after a problem has been corrected. Verify the result from the syslog and by checking the status of the MO.

4. An alarm for a recovery group implies a multiple error situation (for example, multiple node failures) or a persistent configuration or corruption problem. In this case, contact your local Nokia Siemens Networks representative.


Testing instructionsScenario 1: Alarm for a node

1. Restart an operational unlocked node using fshascli. For example,$ fshascli --state /AS-1/AS-1administrative(UNLOCKED) <== Unlockedoperational(ENABLED) <== Operationalusage(IDLE)procedural()availability()

DN70398724Issue 03B

39

RNC OMS alarms

Id:0900d80580292626

unknown(FALSE)alarm()$ fshascli --restart --nowarning /AS-1/AS-1 is restarted successfully

2. Wait for a few seconds for the node to turn DISABLED. The alarm is raised after this. For example, $ fshascli --state /AS-1/AS-1administrative(UNLOCKED) <== Unlockedoperational(DISABLED) <== No longer operationalusage(IDLE)procedural(TERMINATING)availability()unknown(FALSE)alarm()$ fshascli --state /AS-1/AS-1administrative(UNLOCKED) <== Unlockedoperational(DISABLED) <== No longer operationalusage(IDLE)procedural(TERMINATING)availability()unknown(FALSE)alarm(MAJOR,OUTSTANDING) <== Alarm has been raised

The alarm raising is also visible in the syslog as a message that begins as follows:

ALARM RAISE SP=70159 . . .

The alarm is automatically cancelled when the node has successfully restarted.

The alarm cancellation is also visible in the syslog as a message that begins as follows:


Scenario 2: Alarm for a process

1. Terminate an operational and unlocked "modest" severity process. An operational process has ENABLED operational state and an empty procedural status. You can search for modest criticality processes with the fshascli command --view. For example,

$ fshascli --view --filter process "/*/*/*". . . /TA-A/TestApplAServer/TestProcA:Process /TA-A/TestApplAServer/TestProcA command=(/opt/Nokia/SS_ABC/bin/testProcA) status=(fullHA) startMethod=(requested) severity(modest). . .

40 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580292626

$ fshascli -state /TA-A/TestApplAServer/TestProcA /TA-A/TestApplAServer/TestProcA:administrative(UNLOCKED) <== Unlockedoperational(ENABLED) <== Operationalusage(ACTIVE)procedural() <== Empty PS = runningavailability()unknown(FALSE)alarm()role(ACTIVE)

$ ssh TA-A killall testProcA

2. Verify that the alarm was raised and (very likely) also immediately cancelled. The HAS cancels the alarm immediately if the process repair cycle allowed an immediate restart.



Similarly, the alarm cancellation is also visible in the syslog as a message that begins as follows:


Scenario 3 : Alarm for a recovery unit

1. Terminate an operational and unlocked "important" severity process. This causes a failure of the recovery unit. An operational process has ENABLED operational state and an empty procedural status. You can search for important criticality processes with the fshascli command --view. For example,

$ fshascli --view --filter process "/*/*/*". . . /TA-A/TestApplBServer/TestProcB:Process /TA-A/TestApplBServer/TestProcB command=(/opt/Nokia/SS_ABC/bin/testProcB) status=(fullHA) startMethod=(requested) severity(important). . . $ fshascli -state /TA-A/TestApplBServer/TestProcB/TA-A/TestApplBServer/TestProcB:administrative(UNLOCKED) <== Unlockedoperational(ENABLED) <== Operationalusage(ACTIVE)procedural() <== Empty PS = runningavailability()unknown(FALSE)alarm()role(ACTIVE)

DN70398724Issue 03B

41

RNC OMS alarms

Id:0900d80580292626

$ ssh TA-A killall testProcB

2. Verify that the alarm was raised and (very likely) also immediately cancelled. The HAS cancels the alarm immediately if the recovery unit repair cycle allowed an immediate restart.The alarm raising is also visible in the syslog as a message that begins as follows:


Similarly, the alarm cancellation is also visible in the syslog as a message that begins as follows:


42 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805802f6914

1.17 70160 MEMORY USAGE OVER LIMITProbable cause: Threshold crossed



MeaningMemory consumption is too high because some processes are using too much memory.

There is a risk that the node is unable to fulfil the tasks allocated to it because the pro-cesses cannot reserve enough memory for their use. As a result, the processes cannot perform the tasks allocated to them.



Instructions

1. RuntopLinux command on the node that reports the alarm to view a snapshot of the current global memory. Press M to sort the processes in the node based on their memory resident size to check which processes consume the most memory.

2. If the problem persists, contact your local Nokia Siemens Networks representative and provide them with the information gathered in the previous step.

ClearingThe alarm is automatically cleared by the operating system's fault detector once the memory usage is on a low enough level. The raising / clearing thresholds are different to prevent unnecessary trashing.

Testing instructionsDo not test this alarm, because testing it will result in reduced quality of service.

DN70398724Issue 03B

43

RNC OMS alarms

Id:0900d805803aa9bd

1.18 70161 OPERATING SYSTEM MONITORING FAILUREProbable cause: System call unsuccessful



MeaningThe fault detector in the operating system has failed to capture the statistics of the usage of a given resource.

The state of the named device cannot be discovered, which may indicate that there are some fundamental problems with it.

Identifying additional information fields

1. Failed subsystem2. Failed resource, where the values are

• CPU: Index of the processor • FILESYSTEM: Name of the mountpoint • ETHERNET: Name of the interface • MEMORY: • RAID: Name of the device • FC (Fibre Channel):


InstructionsIf the alarm is not cleared automatically, contact your Nokia Siemens Networks repre-sentative.

ClearingDo not clear the alarm. The alarm is automatically cleared when the fault detector of the operating system is able to capture the statistics of the failed resource.

Testing instructionsThis alarm is difficult to test, because the hardware problem cannot be simulated.

44 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805804b01a5

1.19 70162 RAID ARRAY HAS BEEN DEGRADEDProbable cause: Disk problem


Default severity: 3 Major

MeaningRedundancy of the RAID array is lost. A device belonging to the RAID array can be marked faulty by the system. The alarm may be caused by either errors in the fibre channel (FC) or small computer system interface (SCSI) bus or by a potentially broken disk media.

In the case of a subsequent disk failure, data will be lost.

Identifying additional information fields1. RAID array.

Additional information fields2. Faulty device (optional).

InstructionsIf the hardware is FlexiServer Blade Hardware, then follow these instructions:

1. Use the command cat /proc/mdstat to check the status of the RAID array found in the Identifying additional information field of the alarm on the node that reports the alarm.The [UU] field printed by the command describes whether both of the disks are in the RAID array or not. If this field contains [_U] or [U_], one of the disks is not in the RAID array.

2. The redundancy of the RAID array should be automatically restored by the system within an hour. If the problem persists and the alarm is not cleared within an hour, contact your local Nokia Siemens Networks representative.

3. If the problem persists, try changing the faulty disk according to the hardware main-tenance instructions. If that does not help, contact your local Nokia Siemens Networks representative.

If the hardware is IBM BladeCenter, then follow these instructions:

1. Check the Maintenance Module and find the faulty disk and the possible cause of the fault. Replace the faulty disk with a new disk, referring to the hardware mainte-nance documentation for detailed replacement instructions.

2. The redundancy of the RAID array should be automatically restored by the system within an hour. If the problem persists and the alarm is not cleared within an hour, contact your local Nokia Siemens Networks representative.

ClearingThe alarm is automatically cleared by the operating system's fault detector once the redundancy of the RAID array is restored.

Testing instructionsDo not test this alarm in a live system. Any real disk faults during the execution of this test may lead to data corruption.

DN70398724Issue 03B

45

RNC OMS alarms

Id:0900d8058047092d

1.20 70163 ETHERNET INTERFACE USAGE OVER LIMIT Probable cause: Threshold Crossed



MeaningThe Ethernet interface is used at a very high level. This alarm may be raised, for example, when large files are copied over the network causing a lot of network file system (NFS) traffic.

Packages are not lost yet but if the interface is loaded increasingly, packages might eventually be lost.

Identifying additional information fields1. Bonding interface

2. Ethernet interface


InstructionsThis is an informative alarm and does not require direct actions.

ClearingThe alarm is automatically cleared by the operating system's fault detector once the Ethernet load has decreased to a tolerable level.

Testing instructionsDo not test this alarm, because testing it will create instability in the system.

46 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580384d0c

1.21 70164 ETHERNET LINK FAILURE Probable cause: Link failure



MeaningThe redundancy of Ethernet is lost because of an Ethernet link failure. The error might have been caused by a hardware failure, that is, a potentially broken Ethernet port, by an unplugged cable on the front panel of the gateway (GW) node, or if some program or user has issued a command shutting down the Ethernet interface.

In case of subsequent link failure, the Ethernet packages are lost which means that the node cannot receive or transmit data over the network.

Identifying additional information fields1. Bonding interface

2. Ethernet interface


Instructions

1. If the alarm is raised for an external Ethernet interface, check that the cable is properly connected in the front panel of the GW node.

2. Take a console connection to the node with the alarming interface.3. Check the status of the interface with the following command:

ifconfig -a <interface>For example, ifconfig -a eth0

4. Assuming that the interface does not have the UP and RUNNING flags set, try to configure the interface UP with the following command ifup <interface>For example, ifup eth0

5. If the previous steps have not resolved the situation, contact your local Nokia Siemens Networks representative.

ClearingThe alarm is automatically cleared by the operating system's fault detector when the Ethernet link comes up.

Testing instructionsDo not test this alarm, because testing it will create instability in the system.

DN70398724Issue 03B

47

RNC OMS alarms

Id:0900d80580465c3b

1.22 70166 MANAGED OBJECT LOCKEDProbable cause: Software program abnormally terminated



MeaningThe administrative state of the named managed object (MO) which can be a cluster, a node, or a recovery unit (RU) has changed to LOCKED as a result of a user action (grace-ful shutdown or lock operation).

The named MO and its child MOs have been stopped and will not be started before a corresponding unlock operation is performed by the user. The service provided by the MO is not available, unless the MO is a RU with some operational and UNLOCKED redun-dant resources.

When a MO is locked, the alarm system of the cluster clears the alarms raised by the MO and its child MOs.


Additional information fieldsIdentifies the MO type: a cluster, a node, or a RU.

InstructionsThis is an informative alarm and does not require any actions.

ClearingDo not clear the alarm. This is an informative alarm and will be cleared automatically by the alarm system after its time to live has expired.

Testing instructionsLock the managed object using fshascli. For example:

$ fshascli --lock --nowarning /AS-1/FSNodeDNSServer


ALARM RAISE SP=70166...

Note that test case for alarm 70189 MANAGED OBJECT UNLOCKED BY OPERATOR should be run after this to get the initial situation restored.

48 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805803276a5

1.23 70168 CLUSTER STARTED (RESTARTED) Probable cause: Software environment problem



MeaningThe whole cluster is starting or restarting.

Starting or restarting of the whole cluster means (re)starting of all managed objects within the cluster.

The (re)start may have been initiated by an operator or be caused by fatal errors in some critical hardware or software component. When the cluster is restarted, the alarm system clears all alarms that were raised by the cluster's managed objects before the restart.



InstructionsThis alarm is an informative alarm indicating that the whole cluster has been (re)started. As this operation is critical for software and hardware, check carefully the alarm status in the cluster after the restart.

ClearingClear the alarm after carefully checking the alarm status in the cluster.


1. Restart the cluster usingfshascli:$ fshascli --restart --nowarning /

2. Wait for the cluster to restartThe alarm is visible in the alarm database (if configured) and in syslog as a message that begins as follows:

ALARM RAISE SP=70168 ...

3. Note that all services are unavailable during restart.

DN70398724Issue 03B

49

RNC OMS alarms

Id:0900d8058034a2fb

1.24 70173 BACKEND DATABASE REQUIRED BY CORBA NAMING SERVICE IS UNAVAILABLE Probable cause: Underlying Resource Unavailable



MeaningThe MySQL database instance DB_CosNaming, used by the private CORBA naming service (NaS) instance, cannot be contacted by the NaS wrapper. Note that the recovery group that owns the backend database is NamingServiceDB and CORBA NaS instances belong to recovery group PrivateCosNaming.

The CORBA NaS is not able to store data in the database. Therefore the CORBA NaS is not functional and replies to the high availability services (HAS) heartbeats with a failure indication.



Instructions

• Check that the error situation still exists /opt/Nokia/SS_Naming/bin/ns_listallThese commands should list the content of the private naming graphs when the NaS is working correctly. If the command throw exceptions, the NaS is not working cor-rectly, which may result, for example, from an unavailable backend database.

• Check if the backend database DB_CosNaming (RG NamingServiceDB) is unlocked and active.fshascli -s /NamingServiceDBIf the NamingServiceDB is locked, unlock it.fshascli -u /NamingServiceDB After a few seconds the database should have restarted and the NaS should have automatically re-established connections. Ensure the restart and the re-established connections by issuing the ns_listall command mentioned above.

• If this does not solve the problem, there is something wrong with the database deployment or configuration. In that case, also the alarm 70156 DISK DATABASE WATCHDOG START-UP FAILED should be raised by the MySQL DB watchdog dedicated for the DB_CosNaming database instance.The following steps describe the error checking procedure if NamingServiceDB RG fails (see alarm description 70156 DISK DATABASE WATCHDOG START-UP FAILED for more information).

1. Check the master-syslog for any indication of errors.less /var/log/master-syslog

2. Check that the LDAP (Lightweight Directory Access Protocol) server is up and running. • Check that the RG owning the LDAP server is unlocked.

fshascli -s /Directory

50 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058034a2fb

• Check that the LDAP server is really working by listing the content of the LDAP tree (CTRL-C aborts the listing).ldapsearch

3. If the LDAP is working correctly, check that the DB directory mount is functional: • Lock the NamingServiceDB RG (if not yet locked). • Mount the database directory manually.

a) Create the SW RAID (md device) to where the DB_CosNaming directory is stored at.

create_sw_raid /dev/md8 \ /dev/VG_62/MySQL_DB_CosNaming \ /dev/VG_63/MySQL_DB_CosNaming

Note that the device paths given as arguments above may be different in your system.Check the correct device paths from:/opt/Nokia_BP/etc/ldapfile/ldif_in/PFSAN*.ldifThe device paths are defined under an entry defining the FSHWSWRAID object class for the NaS:dn: fshwStorageResourceName=/dev/md8, fshwSANName=0,fsFragmentId=HW, fsClusterId=ClusterRootfshwStorageResourceName: /dev/md8objectClass: FSHWStorageResourceobjectClass: FSHWSWRAIDobjectClass: extensibleObjectfshwRAIDLevel: 1fshwPartitionName: /dev/VG_62/MySQL_DB_CosNamingfshwPartitionName: /dev/VG_63/MySQL_DB_CosNamingfsUserComment: MySQL DB for CORBA Naming Service

b) Mount the directory.mkdir /tmp/tmp_nasDBmount /dev/md8 /tmp/tmp_nasDB

Remember to unmount the directory and to stop the md device after the following checks have been performed (see the last step).

4. Check that the database disk content is accessible and readable ls -la /tmp/tmp_nasDB

5. Check that the my.cnf and odbc.ini files exist in that directory and have read access rights. Check also that these files are identical to those under the SS_Naming home directory.

diff /tmp/tmp_nasDB/odbc.ini /opt/Nokia/SS_Naming/etc/odbc.inidiff /tmp/tmp_nasDB/my.cnf /opt/Nokia/SS_Naming/etc/my.cnf

6. Check the mysql.err file for any error indications. You can also find this file from the /tmp/tmp_nasDB directory.

7. Remove the mount and stop the md devicesa) Unmount and remove the directory.

umount /tmp/tmp_nasDBrmdir /tmp/tmp_nasDB

DN70398724Issue 03B

51

RNC OMS alarms

Id:0900d8058034a2fb

b) Stop the md device.mdadm --manage -S /dev/md8

If any of the preceding checks fail, a major software failure exists in the system. In that case, contact your Nokia Siemens Networks representative with the information gathered during the preceding steps.

ClearingHAS clears the alarm automatically when it has detected the NaS to be faulty and there-fore restarted the PrivateCosNaming recovery group.

However, if the backend database remains faulty, the alarm is raised again. This may result in a restart loop constantly raising the same alarm. Therefore, if the problem seems to be permanent, it is recommended to lock the NaS and the database recovery groups with the following commands:

fshascli -l /NamingServiceDB

fshascli -l /PrivateCosNaming

and to clear the alarm manually before performing the steps for solving the error.


1. Unlock the NamingServiceDB RG.2. Unlock the CosNaming and PublicCosNaming RGs.3. Running the command /opt/Nokia/SS_Naming/bin/ns_listall should list

all the object bound in the name service. This shows that the Naming Service is func-tional.

4. Lock the NamingServiceDB RG.

Within some tens of seconds the alarm should be raised.

Clearing:

1. Lock the CosNaming and PublicCosNaming RGs.2. Unlock the NamingServiceDB RG.3. Unlock the CosNaming and PublicCosNaming RGs.4. Check with /opt/Nokia/SS_Naming/bin/ns_listall that the naming service

is functional again.

The alarm should be cleared at this point.

The alarm is automatically cleared by the naming service when it re-establishes connec-tions to database.

52 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580344161

1.25 70186 CLUSTER OPERATION INITIATED BY OPERATOR Probable cause: Congestion



MeaningThis is an informative alarm which indicates that an operator has initiated a cluster oper-ation on the specified managed object (MO). The MO can refer to the whole cluster, a node, a recovery unit (RU), recovery group (RG), or a process. The platform high avail-ability services (HAS) is now executing the operation. The operation can be

• switchover • restart • power-off.

The operations have different effects:

• SwitchoverApplicable only to recovery groups (RG). The active RU instance of the RG is termi-nated and a standby instance on another node started or, in case of a hot active standby RG, activated. The service provided by the named RU is down until the swi-tchover is complete.

• RestartFor the cluster and nodes this means a physical restart (reboot) of node(s). For other MOs, the named MO is stopped and restarted. The services provided by the named MO are down during the restart.

• Power-offApplicable only to nodes. The named node is being powered off.


Additional information fields1. Identifies the MO type (the cluster, a node, a process, or an RU).


ClearingThe alarm system clears the alarm automatically after its time to live has expired.


1. Log into the cluster.2. Restart a managed object using fshascli. For example:

fshascli --restart --nowarning /AS-1

The alarm is visible in the alarm database (if configured) and in the syslog as a message that begins as follows:


DN70398724Issue 03B

53

RNC OMS alarms

Id:0900d80580296bb3

1.26 70188 MANAGED OBJECT SHUTDOWN BY OPERATOR Probable cause: Congestion



MeaningThis is an informative alarm which indicates that the specified managed object (MO) which can be the whole cluster, a node or a recovery unit (RU) is being shutdown. The named MO and all its unlocked sub-resources are now terminating.

The MO is being shutdown by an operator. All services provided by the named MO are terminating. Once the operation is completed, the administrative state of the MO and all its sub-MOs will be changed to locked.

Note that a shutdown request may take a long time if the maximum duration for the oper-ation has not been specified. The shutdown request can be forced to completion by issuing a lock command. In that case the platform high availability services (HAS) will terminate the services ungracefully.


Additional information fields1. Identifies the MO type (a cluster, a node, or an RU)

InstructionsThis is an informative alarm which requires no user actions.

ClearingThe alarm system clears this alarm automatically after its time to live has expired.

Testing instructionsThe target of the shutdown command can be a cluster, node, recovery group or recovery unit.

1. Log into the cluster2. Execute the shutdown command to the managed object. For example: fshascli --

shutdown /AS-1

The alarm is also visible in the syslog as a message that begins as follows:

ALARMRAISE SP=70188 ...

Note that in the example above --shutdown does not power off the node. It just grace-fully shuts down all HAS managed non-critical processes in the node.

After the testing is finished, use the fshascli --unlock command to get the initial situation restored. For example:

fshascli --unlock /AS-1

54 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805803d689d

1.27 70189 MANAGED OBJECT UNLOCKED BY OPERATOR Probable cause: Congestion



MeaningThis is an informative alarm which indicates that the specified managed object (MO) which can be the whole cluster, a node, or a recovery unit (RU) has been unlocked. The named MO and its unlocked sub-resources (if there are any) can now be activated.

Notice that the MO (or its sub-MOs) can remain locked because of the dependency on a higher level MOs. That is, the unlock operation will not have effect on the MO in question before the higher level MOs are unlocked. For example, an RU in a node will remain locked, if the node or the cluster MO is locked.

The MO has been set to the unlocked state. If all the higher level MOs are unlocked as well, the services provided by the MO are activated.


Additional information fieldsIdentifies the MO type (a cluster, a node, or an RU)



Testing instructionsUnlock the previously locked managed object using fshascli:

1. Log into the cluster.2. Unlock the managed object using fshascli. For example:

fshascli -unlock /AS-1/FSNodeDNSServer

The alarm is also visible in the syslog as a message that begins as follows:


Note that this test should be run after the test case for alarm 70166 MANAGED OBJECT LOCKED.

DN70398724Issue 03B

55

RNC OMS alarms

Id:0900d805804611f9

1.28 70236 LDAP DATABASE CORRUPTED70236 LDAP DATABASE CORRUPTED

Severity Major

Fault reasonA primary or secondary Lightweight Directory Access Protocol (LDAP) database is cor-rupted and cannot be accessed anymore. An LDAP database can get corrupted, for example, when:

• a disk becomes full while the database is being updated • a node failure and/or ungraceful node restart happens while the database is being

updated.

The identified LDAP database is currently unavailable.

In case of a secondary database, the only impact is that the node start-ups can take slightly longer because some platform services attempt to use the secondary data-base(s) by default.

Failure of the primary database has a more significant impact. Most application pro-cesses cannot be (re)started anymore and applications that update LDAP will fail. If a secondary database is still available, nodes can still be (re)started but only basic platform services will be able to start. If the primary and all secondary databases have failed, the cluster or any of its nodes cannot (re)start anymore. The system will next automatically try to recover the corrupted database from an operational primary or sec-ondary database.

Description



1. Type of the database: Primary or Secondary2. Relative path of the database directory. Notice that secondary databases are

usually located in a directory such as /var/mnt/local/localimg/<platform release>/opt/Nokia_BP/var/pmgmt/pt/Nokia_BP/var/pmgmt/<platform release>/fsPlatformSlave-ldbm. Primary LDAP database directory is usually of the following format: /var/mnt/local/sysimg/<platform release>/opt/Nokia_BP/var/pmgmt/<platform release>/fsPlatform-ldbm. Notice especially that the lowest level directory is fsPlatformSlave-ldbm for secondary databases and fsPlatform-ldbm for the primary database.

InstructionsThe system will automatically attempt to recover the corrupted database from a func-tional copy. If the automatic recovery is successful, this alarm is automatically cleared and the system raises a new "CORRUPTED LDAP DATABASE RECOVERED" warning alarm. The automatic recovery, if successful, takes less than a minute.

If the primary and secondary database(s) are all corrupted you must restore them from a backup. DO NOT ATTEMPT TO RESTART THE CLUSTER OR ANY OF ITS NODES

56 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805804611f9

BEFORE ENSURING THAT THE PRIMARY DATABASE IS OPERATIONAL The appli-cations can still be providing service normally and a service interruption only happens if an unsuccessful restart attempt is made.

Notice, however, that the automatic recovery will fail if the node or database disk has become full. In this case, you can attempt to solve the situation by making space to the disk, and then allowing the system to retry automatic recovery. To do this, perform the following steps:

1. Log into the node that has the corrupted database as root user. For example, log into the node (usually CLA-0 or CLA-1) where the directory service is active:ssh root@mycluster-directory<password>

2. Check the available disk space with the df command. For example,df -kroot@CLA-1(mycluster):~# df -k Filesystem 1k-blocks Used Available Use% Mounted on/dev/rd/0 15863 10698 4346 72% /tmpfs 1029260 8 1029252 1% /tmp/dev/md/0 4999712 1401348 3598364 29% /var/mnt/local/localimgdirectory:/var/mnt/local/sysimg

49998408 49998408 0 100% /var/mnt/remote/sysimg_rwdirectory:/var/mnt/local/sysimg

49998408 49998408 0 100% /var/mnt/remote/sysimg_ro/dev/md/1 49998404 49998408 0 100% /var/mnt/local/sysimg/dev/md/9 19999256 32840 19966416 1% /var/mnt/local/backup

3. If the database partition (in this example the system image partition) is full, release space, for example, by deleting excess core and syslog files. You can locate large files from the partition using the find command: Use the cd command to go to the partition mount point directory and search files below it. For example,

cd /var/mnt/local/sysimgfind . -type f -name "syslog*" -size +1000000

You can also locate core files using the find command. For example,cd /var/mnt/local/sysimgfind . -type f -name "*core"

When the disk has at least 100 MB of free space, make the system trying the recov-ery: • In case of a secondary database, reboot the node. For example, execute the fol-

lowing command:shutdown -r now

• In case of the primary database, use fshascli to restart the Directory service: fshascli -rnF /Directory Note that this will terminate your terminal connection, thus you will need to log in again.If the database was not corrupted because of a full disk, or the automatic recovery fails again, for example, because all LDAP databases are corrupted,

DN70398724Issue 03B

57

RNC OMS alarms

Id:0900d805804611f9

you must restore the databases from a backup copy. For instructions on the restore process, see the backup and restore customer documentation.

ClearingThe alarm is cleared automatically if the automatic recovery operation is successful. The alarm must be cleared manually, in case the database has to be manually restored from a backup.

Testing instructionsThe alarm can be tested by simulating a secondary LDAP corruption. This can be done by renaming the secondary LDAP database directory in the CLA node where the Direc-tory recovery group is active.

Move to the directory where the secondary LDAP is located. The default location is /var/mnt/local/localimg/flexiserver/opt/Nokia_BP/var/pmgmt/_active/. The location and the name of the database can also be verified from fsPlatformSlave.conf file located under /opt/Nokia_BP/etc/ldapfiles. The secondary LDAP database is defined after "directory" tag.

cd /var/mnt/local/localimg/flexiserver/opt/Nokia_BP/var/pmgmt/_active

Rename the current secondary LDAP database.

mv fsPlatformSlave-ldbm fsPlatformSlave-ldbm.bkp

Execute the LDAP recovery script manually. The execution of the script may take several minutes.

/opt/Nokia_BP/bin/fsLDAPRecoverDatabase -s

The alarm should be visible immediately after starting the recovery script. The script will use the primary LDAP to restore the secondary LDAP database after which the alarm will be cancelled. Also alarm "70237: CORRUPTED LDAP DATABASE RECOVERED" will be raised.

If the alarm was cancelled successfully and a new secondary LDAP database was created the backup database can safely be removed.

rm -rf fsPlatformSlave-ldbm.bkp

If the alarm was not cancelled, the secondary LDAP database was not created or the script was terminated before it could finish, restore the backup database. In this case the alarm needs to be cancelled manually. Remove the partially created secondary LDAP database if one exists.

rm -rf fsPlatformSlave-ldbm

Restore the original database.

cp -r fsPlatformSlave-ldbm.bkp fsPlatformSlave-ldbm

Cancelling

58 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805803a48a2

1.29 70237 CORRUPTED LDAP DATABASE RECOVEREDProbable cause: Corrupt data



MeaningA primary or secondary LDAP (Lightweight Database Access Protocol) database was corrupted but it has been successfully recovered. The LDAP databases can get cor-rupted, for example, when

• a disk becomes full while the database is being updated • a node failure and/or ungraceful node restart happens while the database is being

updated.

The platform software has automatically recovered the database from an operational primary or secondary database. Some applications may have been impacted by the temporary unavailability of the LDAP database. As the platform restarts the failed appli-cations, the problem should not have caused permanent problems.



1. Type of the database that was corrupted; "Primary" or "Secondary".2. Relative path of the database directory. Notice that secondary databases are

usually located in a directory such as /var/mnt/local/localimg/<platform release>/opt/Nokia_BP/var/pmgmt/<platform release>/fsPlatformSlave-ldbm. The primary LDAP database directory is usually in the following format: /var/mnt/local/sysimg/<platform release>/opt/Nokia_BP/var/pmgmt/<platform release>/fsPlatform-ldbm. Notice especially that the lowest level directory is fsPlatformSlave-ldbm for secondary databases and fsPlatform-ldbm for the primary database.

InstructionsThis is an informative alarm. No operator actions required.


Testing instructionsThe alarm can be tested by simulating a secondary LDAP corruption. This can be done by renaming the secondary LDAP database directory in the CLA node where the Direc-tory recovery group is active.

Change the directory to the one where the secondary LDAP is located. The default location is /var/mnt/local/localimg/flexiserver/opt/Nokia_BP/var/pmgmt/_active/. The location and the name of the database can also be verified from fsPlatformSlave.conf file located under /opt/Nokia_BP/etc/ldapfiles. The secondary LDAP database is defined after "directory" tag.

DN70398724Issue 03B

59

RNC OMS alarms

Id:0900d805803a48a2

cd /var/mnt/local/localimg/flexiserver/opt/Nokia_BP/var/pmgmt/_active

Rename the current secondary LDAP database.

mv fsPlatformSlave-ldbm fsPlatformSlave-ldbm.bkp

Execute the LDAP recovery script manually. The execution of the script may take several minutes.

/opt/Nokia_BP/bin/fsLDAPRecoverDatabase -s

Alarm "70236: LDAP DATABASE CORRUPTED" should be visible immediately after starting the recovery script. The script will use the primary LDAP to restore the second-ary LDAP database after which the alarm will be raised. Also alarm "70236: LDAP DATABASE CORRUPTED" will be cancelled.

If the alarm was raised successfully and a new secondary LDAP database was created the backup database can safely be removed.

rm -rf fsPlatformSlave-ldbm.bkp

If the alarm was not raised, the secondary LDAP database was not created or the script was terminated before it could finish, restore the backup database. Remove the partially created secondary LDAP database if one exists.

rm -rf fsPlatformSlave-ldbm

Restore the original database.

cp -r fsPlatformSlave-ldbm.bkp fsPlatformSlave-ldbm

60 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580508ba4

1.30 70243 ALARM PROCESSOR CONFIGURATION IS OUT OF ORDERProbable cause: Configuration or customising error


Default severity: 4 Minor

MeaningThe configuration of alarm processor contains an invalid attribute value or an attribute is missing.

The system ignores the invalid value and uses a default value.


1. Invalid attribute's value or empty string if attribute or its value is missing.

Instructions

1. Use the parameter management application to correct the invalid value of the attri-bute. The distinguished name of the attribute - identifying its location in the LDAP - can be found in the 'Managed Object Id' field of the alarm.

2. Restart alarm processor with the following command:fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessorwhere <Node> is the name of the node where alarm processor is deployed. The default values of the alarm processor attributes used when correcting the situation are listed below:

Attribute Default value

fsNumProcessors:

5

fsHasSimpleAware:

true

fsHasSimpleBindAddr:

localhost

fsHasSimplePort:

49703

fsHasSimpleBackLog:

3

fsLogFileName:

/var/log/master-alarms

fsLogParserSleepTime:

10

fsAlarmNotificationCollectorSleepTime:

12

DN70398724Issue 03B

61

RNC OMS alarms

Id:0900d80580508ba4



1. Use the parameter management application to set an invalid value for an attribute, for example, a negative value for fsAlarmNotificationCollectorSleepTime (DN for it in LDAP is fsParameterId=fsAlarmNotificationCollectorSleepTime,fsAlarmProcessorConfigurationId=Default ,fsAlarmProcessorId=AlarmProcessor1 , fsFragmentId= AlarmProcessors, fsFragmentId=AlarmMgmt, fsClusterId=ClusterRoot).

2. Restart alarm processor with the following command:fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessorwhere <Node> is the name of the node where alarm processor is deployed.

3. After verifying that an alarm for the situation has been raised, correct the fault as described in the 'Instructions for operator' field and check that the alarm is cleared.

fsParameterNotificationProcessorSleepTime:

15

fsAlarmHistoryProcessorSleepTime:

300

fsAlarmHistorySize:

604800000

fsBatchSize:

20

fsHeartbeatInterval:

300

fsAlarm70247raise:

true

fsSeverityChangeReRaise:

False

fsNotificationBatchSize:

20

fsStrictAlarmTimeOrder:

false

62 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580348bbc

1.31 70244 CORRUPTED ALARM DATAProbable cause: Corrupt data



MeaningCorrupted data found in the alarm log file.

The corrupted record in the alarm log file is ignored, meaning that it is possible that an alarm notification was lost or a more serious system error has occurred.

Identifying additional information fields1. Invalid record (please note that the field can hold no more than ~390 symbols, so the original invalid record can be cut).

Additional information fields2. Error code, possible values:

1. missing mandatory field2. duplicated field 3. empty record4. non-alarm data record.

3. Field name (for missing or duplicated field).

Instructions

1. Fill in a problem report with the alarm data and send it to your local Nokia Siemens Networks representative.

ClearingClear the alarm with alarm management application after correcting the fault as pre-sented in Instructions.


1. Create a text file containing an empty row or a row with some dummy information.2. Use the parameter management application to store the value of the

fsParameterId=fsLogFileName, fsAlarmProcessorConfigurationId=Default, fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=AlarmProcessors, fsFragmentId=AlarmMgmt, fsClusterId=ClusterRoot attribute in the alarm processor LDAP configuration and replace it with the name of the created file.

3. Restart alarm processor with the following command:fshascli -r /<node>/FSAlarmSystemServer/AlarmProcessor

where <node> is the name of the node where alarm processor is deployed.4. After verifying that an alarm for the situation has been raised, clear it with alarm man-

agement application.5. Use the parameter management application to restore the original name of the

alarm log file.6. Restart alarm processor.

DN70398724Issue 03B

63

RNC OMS alarms

Id:0900d805804e145a

1.32 70245 ILLEGAL INTERNAL USAGE OF EXTERNAL ALARM NOTIFICATION FORMATProbable cause: Software Program Error

Event type: x2


MeaningThe application raised or cleared an alarm containing an internal MOID (Managed Object ID) and provided its own alarm time. The application is allowed to provide an alarm time only for external alarms (alarms with external MOIDs). This alarm is also raised if the application raised or cleared an alarm containing an external MOID but did not provide its own alarm time.

The original alarm is discarded.

Identifying additional information fieldsData from the original alarm:

1. Managed Object ID2. Specific problem 3. Identifying application additional information

(The application ID is present in the MOID field of the alarm)


InstructionsFill in a problem report with the alarm data and send it to your Nokia Siemens Networks representative.



1. Create a text file containing the following single row:2008 Oct 15 18:31:39 ALARM RAISE SP=70156 \MO=fshaProcessInstanceName= XWDforAlarmType,\fshaRecoveryUnitName=FSAlarmDBServer,fsipHostName=WAS,\fsFragmentId=Nodes,fsFragmentId=HA,fsClusterId=ClusterRoot \AP=fshaProcessInstanceName=XWDforAlarmType,\fshaRecoveryUnitName=FSAlarmDBServer,fsipHostName=WAS,\fsFragmentId=Nodes,fsFragmentId=HA, fsClusterId=ClusterRoot \ SE=5 NINFO="1" TIME=E1224084699996

2. Use the parameter management application to store the value of the fsParameterId=fsLogFileName, fsAlarmProcessorConfigurationId=Default, fsAlarmProcessorId=AlarmProcessor1, fsFragmentId= AlarmProcessors, fsFragmentId=AlarmMgmt,

64 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805804e145a

fsClusterId=ClusterRoot attribute in the alarm processor LDAP configuration and replace it with the name of the created file.

3. Restart alarm processor with the following command:fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessorwhere <Node> is the name of the node where alarm processor is deployed.

4. After verifying that an alarm for the situation has been raised (in the case of an internal MOID with provided time), clear it with the alarm management application.

5. Use the parameter management application to restore the original name of the alarm log file.

6. Restart the alarm processor.7. Create a text file containing the following single row:

2008 Oct 15 18:32:39 ALARM RAISE SP=70159 \MO=rncMOId=DN:NE-WBTS-34/WCEL-1,fsLogicalNetworkElemId=OMS,\fsFragmentId=external,fsClusterId=ClusterRoot AP=fshaProcessInstanceName=HASNodeAgent,\fshaRecoveryUnitName=FSNodeHAServer, \fsipHostName=CLA-0,fsFragmentId=Nodes,fsFragmentId=HA, \fsClusterId=ClusterRoot SE=3 NINFO="MO failed".

8. Repeat steps 2,3.9. After verifying that an alarm for the situation has been raised (in the case of an

external MOID without provided time), clear it with the alarm management applica-tion.

10. Repeat steps 5, 6.

DN70398724Issue 03B

65

RNC OMS alarms

Id:0900d8058040be43

1.33 70246 ALARM SYSTEM HEARTBEATProbable cause: Timeout expired



MeaningThis is an informative alarm, which indicates that the alarm system itself is in operational state. The alarm system is continuously (after each expiration of a heartbeat interval) raising or clearing this alarm, which means that the state of this alarm is constantly changing in a loop (new alarm > cleared alarm > new alarm > cleared alarm > new alarm > ...) and the alarm time is updated by the time of the last raise or clear operation. If the refreshing of the alarm does not occur, it signals that the alarm system is faulty.

Note that there is a delay before the raise/clear operation becomes visible in the alarm monitoring tool. If the system is under heavy load it might take even longer for the oper-ation to be visible in the alarm monitoring tool.



1. Heartbeat interval in seconds.

Instructions

1. If the used alarm monitor tool does not support an automatic alert in situations where the alarm system heartbeating is not functioning, check occasionally that the heart-beating functions properly. The time of the alarm and the value of the heartbeat interval (specified in the 'Application Additional Info' field) should be used in the analysis of the situation.

2. Perform such checking also when the system does not generate any alarm events for a long time.

3. If the checking shows that the alarm time is not continuously refreshed, restart the alarm processor with the following command:fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor

where <Node> is the name of the node where the alarm processor is deployed.4. If restarting the alarm processor does not help, also restart the alarm system

database with the following command:fshascli -r /AlarmDB

ClearingThe alarm system clears the alarm when the heartbeat interval expires.


1. Check with the parameter management application that the alarm system heartbeat-ing is switched on, for example, the fsParameterId= fsHeartbeatInterval, fsAlarmProcessorConfigurationId=Default, fsAlarmProcessorId=AlarmProcessor1, fsFragmentId= AlarmProcessors, fsFragmentId=AlarmMgmt,

66 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058040be43

fsClusterId=ClusterRoot attribute in the alarm system LDAP configuration has a positive value (set the positive value if it is needed).

2. Restart the alarm processor with the following command:fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor

where <Node> is the name of the node where the alarm processor is deployed.3. With the alarm system heartbeating switched on, check that only one instance of this

alarm is raised or cleared within a period that is approximately equal to the heartbeat interval.

DN70398724Issue 03B

67

RNC OMS alarms

Id:0900d805803a51f2

1.34 70247 ALARM SYSTEM HEARTBEATING SWITCHED OFFProbable cause: Configuration or Customising Error



MeaningThe alarm system heartbeating is switched off, which means that the alarm system does not raise or clear its heartbeat alarms.

The alarm system heartbeating is the simplest and most efficient way for the operator to monitor that the alarm system itself is healthy. If the system is in a switched off state, the operator cannot detect if the alarm system becomes faulty. This is why it is strongly rec-ommended that you have the alarm system heartbeating always switched on. Neverthe-less the alarm system heartbeating can be switched off if an alternative heartbeating exists. In the alarm system configuration, by setting the value of the fsAlarm70247raise configuration parameter to false, raising the 70247 alarm will be disabled.



1. Heartbeat interval in seconds.

Instructions

1. Use the parameter management application to set a non-zero (0 means that heart-beating is switched off) heartbeat interval in seconds for the fsParameterId= fsHeartbeatInterval, fsAlarmProcessorConfigurationId=Default, fsAlarmProcessorId=AlarmProcessor1, fsFragmentId= AlarmProcessors, fsFragmentId=AlarmMgmt, fsClusterId=ClusterRoot attribute in the alarm system LDAP configuration.

2. Use the parameter management application to set the value of the fsParameterId=fsAlarm70247raise,fsAlarmProcessorConfigurationId=Default, fsAlarmProcessorId=AlarmProcessor1, fsFragmentId= AlarmProcessors, fsFragmentId=AlarmMgmt, fsClusterId=ClusterRoot attribute to false in the alarm system LDAP configu-ration for the case when the alarm system heartbeating is desired to be switched off.

3. Restart alarm processor with the following command:fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor

where <Node> is the name of the node where alarm processor is deployed.

ClearingThe alarm system clears the alarm automatically after restart if the alarm system heart-beating is switched on in the configuration.


1. Use the parameter management application to set the value of the fsParameterId= fsHeartbeatInterval, fsAlarmProcessorConfigurationId=Default, fsAlarmProcessorId=AlarmProcessor1, fsFragmentId=

68 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805803a51f2

AlarmProcessors, fsFragmentId=AlarmMgmt, fsClusterId=ClusterRoot attribute to zero in the alarm system LDAP configu-ration. The value of the fsParameterId=fsAlarm70247raise, fsAlarmProcessorConfigurationId=Default, fsAlarmProcessorId=AlarmProcessor1, fsFragmentId= AlarmProcessors, fsFragmentId=AlarmMgmt, fsClusterId=ClusterRoot attribute should be true.

2. Restart alarm processor with the following command:fshascli -r /<Node>/FSAlarmSystemServer/AlarmProcessor

where <Node> is the name of the node where alarm processor is deployed.3. After verifying that an alarm for the situation has been raised, correct the fault as

described in the 'Instructions for operator' field and check that the alarm is cleared.

DN70398724Issue 03B

69

RNC OMS alarms

Id:0900d80580439c1e

1.35 70256 RESOURCE ALLOCATION OR DE-ALLOCATION FAILUREProbable cause: Software Program Abnormally Terminated



MeaningAllocation or deallocation of resources to or from a computer node in the cluster has failed.

Applications running in the cluster are often identified with resources that are allocated to the node before the application is started and released from the node after the appli-cation has terminated. Such resources can, for example, be TCP/IP addresses that are associated with the service provided by the software or a disk partition that for example contains the application database. In addition, the application can allocate and deallo-cate other resources (for example, start and stop 3rd party applications) in its control scripts.

An operation failure has been reported for the defined recovery unit while it was starting or stopping.

If the error occurred when an application was starting, application start-up is aborted. In case of a permanent fault, the service provided by the application is now down. With a transient or node-specific fault, and providing that the application has a standby, the application may have been restarted successfully on another node.

If the fault happened while the application was terminating, the node on which the error happened has now been restarted to restore it to a known state. If the node has restarted successfully or the application has a standby resource, the application has likely already restarted, and service is again available.



1. Name of the recovery group to which the recovery unit belongs. For example, "/Directory".

2. Situation when the failure happened: string "allocating" or "de-allocating"3. Type of the resource allocation: "IP(address)", "disk(mount point)" or "ctrlscript". For

example, "IP(192.1.1.78)" or "disk(sysimg)". 4. Only present if argument 3 is "ctrlscript". Contains the name of the control script that

reported the failure. For example, "RUControlDirectoryServer.sh"

Instructions

1. Log into the network element as root user to check the situation. 2. Use the fshascli command to check the state of all recovery units within the

recovery group (name of the recovery group is in the Application Additional Informa-tion field). If the recovery group is providing service, its every UNLOCKED recovery unit that has the ACTIVE role, has the ENABLED operational state and an empty procedural status. For example, the state of recovery units of the /Directory recovery groups can be checked as follows:

70 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580439c1e

$ fshascli --state $(fshascli -children /Directory | grep -vE "\/.+\/.+\/" )/CLA-0/FSDirectoryServer:administrative(UNLOCKED)operational(ENABLED)usage(IDLE)procedural(NOTINITIALIZED)availability()unknown(FALSE)alarm()role(COLDSTANDBY)

/CLA-1/FSDirectoryServer:administrative(UNLOCKED)operational(ENABLED)usage(ACTIVE)procedural()availability()unknown(FALSE)alarm()role(ACTIVE)

In the above case, the recovery unit of the CLA-0 node is acting as a cold standby backup and the recovery unit on CLA-1 is running the service normally. Note that the grep command in the example is used to filter out information regard-ing individual processes in each recovery unit. Since this is a situation that may be caused by various different faults, contact your Nokia Siemens Networks represen-tative to analyse the root cause.

ClearingClear the alarm manually after the problem has been solved.

Testing instructionsSimulate an IP address allocation failure

1. An IP address allocation failure can be caused by manually allocating an IP address to a node before a recovery unit is started. Select a cold active/standby recovery group (but do not use the Directory recovery group) that has an IP address associ-ated with it, and allocate the address to the standby node. For example:

$ fshascli --state /CLA-0/FSClusterDNSServer/CLA-0/FSClusterDNSServeradministrative(UNLOCKED) <== Unlockedoperational(ENABLED) <== Operationalusage(IDLE)procedural(NOTINITIALIZED)availability()unknown(FALSE)alarm()role(COLDSTANDBY)$ grep ClusterDNS /etc/hosts

DN70398724Issue 03B

71

RNC OMS alarms

Id:0900d80580439c1e

192.168.2.255 ClusterDNS. . . $ ip addr show | grep 192.168.2.255inet 192.168.2.255/23 scope global secondary bond0inet fe80::192:168:2:255/10 scope link$ ssh cla-0Last login: . . . $ ip address add 192.168.2.255/23 dev bond0

2. Issue a switchover for the recovery group so that the service attempts to move to the node that already has the IP address. For example:

$ fshascli --switchover /ClusterDNS

The switchover fails and the alarm gets raised. The alarm is visible, for example, in the alarm log. Note that you have to cancel the alarm manually.

3. Remove the IP address that you added manually or reboot the node. For example:$ ip address del 192.168.2.255/23 dev bond0

72 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058044eb6a

1.36 70265 RECOVERY ACTIONS BANNED FOR MANAGED OBJECTProbable cause: Software Error



MeaningAn operator has set the specified managed object to an inert mode. The managed object identifies a node. If the inert mode is set for the whole cluster, this alarm is raised sep-arately for each node. While the inert mode is on, high availability services (HAS) does not attempt to recover services from failures, for example, by restarting nodes or appli-cations, or by performing switchovers within the specified managed objects. Note that the inert mode should be used only by qualified supplier's representatives when analysing problems in the system.

The inert mode is switched on by issuing an fshascli command, for example:

$ fshascli --inert-mode on /CLA-0

The command above switches the inert mode on for the /CLA-0 node. Accordingly, the inert mode can be switched off by using the fshascli command:

$ fshascli --inert-mode off /CLA-0

This alarm is raised when an operator switches the inert mode on for either a set of nodes or the cluster. The inert mode has the following effects on the behaviour of the system in nodes for which the inert mode has been switched on:

• If there are no failures, the service provided by the network element is not affected. • If failures occur, no recovery actions are performed and the service may be affected.

For example, if a process fails, it is not restarted by HAS. • Process failures are still propagated to the recovery unit level, but the recovery unit

level fault recovery does not take place. In practice, this means that the propagated process failure does not cause restarts of other recovery unit processes, and swi-tchovers do not take place with active/standby recovery groups.

• HAS logs pending recovery actions to master syslog (/var/log/master-syslog on the active CLA node) in the form "INFO Inert mode set for <managed object name>. Recovery action \"restart\" pending.".

• HAS does not raise any alarms for managed objects in the inert mode. The inert mode for a node sets all managed objects within the node to the inert mode.

• The inert mode sustains in the nodes over node or cluster restarts. • Only the node and cluster restart, power on and power off fshascli commands

work while the inert mode is set for the nodes or the cluster.

Note that fault recovery works in a normal way in the nodes that are not in the inert mode .



DN70398724Issue 03B

73

RNC OMS alarms

Id:0900d8058044eb6a

Instructions

1. To ensure proper functionality of the system, switch off the inert mode after the problem analysis is done.

2. You can switch off the inert mode from all nodes of the cluster by issuing the fshascli command: $ fshascli --inert-mode off /

Note that this should be done by the supplier's field engineer that is currently analysing the system.

When the inert mode is switched off, pending recovery actions take place. For example, if an important severity process in a cold active/standby recovery group has failed in a node that was in the inert mode, switching the inert mode off for the node causes a swi-tchover of the recovery group.

ClearingThe system clears the alarm when the inert mode is switched off from the managed object.


1. Switch the inert mode on for the cluster:$ fshascli --inert-mode on / An alarm should be raised for all present nodes of the cluster.

2. Switch the inert mode off for the cluster: $ fshascli --inert-mode off / The alarm should be cancelled for all present nodes of the cluster.

74 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805805c5c9c

1.37 70267 EXTERNAL USER ACCOUNT VALIDATION FAILEDProbable cause: Configuration or Customising Error



MeaningNetwork Element (NE) has detected that according to the NetAct Remote User Informa-tion Management (RUIM) LDAP (Lightweight Directory Access Protocol) access control lists, an external user account defined in NetAct LDAP user database has permissions for this NE. According to the NE security architecture, remote user accounts are repli-cated locally. The validation check performed before the replication for the user account did not pass and therefore the user account was not replicated.

Possible reasons for a failing validation check are:

1. External username is the same as one of the NE internal usernames. This should not happen if NetAct is following the agreed way of naming users.

2. External username is a reserved username. 3. External username is invalid, for example, too long (supported usernames are up to

31 characters long).4. External username contains invalid characters.5. Account is not assigned with any valid permissions.6. External user ID is the same as one of internal user IDs.7. External user ID is not in the supported range.8. Some permissions do not map to any valid groups.9. User ID is not a valid number.

The user account cannot be used to log into the NE (except for case 8 above, where user is still able to log in).

Identifying additional information fieldsUsername

Additional information fieldserror type (1-9 according to the list in "Meaning of alarm")

uid (numeric user ID). Note that in case of error type 9, the user ID in this field is set to -1

comma-separated list of invalid group names (for error type 8)

InstructionsCheck that the username complies with the restrictions imposed by the NE and correct the account information in NetAct LDAP.

The restrictions (based on /RUIMFLEXI/) are the following:

• the username must be created according to [a-zA-Z0-9_.][a-zA-Z0-0_-.]{0,30}{a-zA-Z0-9_.$-]? (32 characters maximum)

• the username cannot start with one of the prefixes reserved for network elements: "_nok", "_nsn"

• the username cannot be the same as one of the reserved names from the list (defined in /RUIMFLEXI/): root, wheel, daemon, adm, sync, shutdown, halt, lp, mail,

DN70398724Issue 03B

75

RNC OMS alarms

Id:0900d805805c5c9c

uucp, operator, games, nobody, gopher, nfs, nfsnobody, named, ntp, ldap, mysql, postgres, apache, sshd, rpm, dbus, vcsa, nscd

• the numeric user ID of a RUIM user must be in the range of [1.000, 9.999.999], that is, greater or equal to one thousand and less than ten million.

• the account must be assigned with at least one valid permission. Valid permissions are those that allow mapping an external user account to one or more network element groups.

ClearingClear the alarm with an alarm management application (for example, Alarm Browser) after correcting the fault as presented in Instructions.

Testing instructionsThe test setup must include an external LDAP server supporting the RUIM schema (defined in /RUIMSCHEMA/).

Before you start, check that:

• FlexiPlatform cluster is commissioned and up. • NE account is defined in the NE's internal LDAP (NWI3 Security fragment). • External LDAP server is up. • All RUIM-related RGs (RuimRep and PAP) are unlocked and enabled.

1. Create a user account in external LDAP in a way that conflicts with the restrictions described in the Meaning of the alarm section.

2. Make this user a member of an LDAP ACL that is linked with ruiAuthObject that defines a valid permission in the network element. For example,

ruiAuthObject and ruiAuthOperation. dn: ruiAuthObjectName=fsui,ou=SystemPermissionsSet,ou=NetAct,ou=Authori zation,ou=ruim, ou=region-911080,ou=regions,ou=NetAct,dc=noklab,dc=netruiIsStereoType: FALSEruiAuthObjectName: fsuiobjectClass: topobjectClass: ruiAuthorizationObjectruiMgmtDomain: ALL

dn: ruiAuthOperationName=monitor,ruiAuthObjectName=fsui,ou=Syst emPermissionsSet,ou=NetAct,ou=Authorization,ou=ruim, ou=region-911080,ou=reg ions,ou=NetAct,dc=noklab,dc=netruiIsScopeDependent: FALSEobjectClass: topobjectClass: ruiAuthorizedOperationruiClassification:ruiAuthOperationName: monitor

You can construct the group name _nokfsuimonitor, if applying the rule "_nok"+rui-AuthObject+ruiAuthOperation. Making a user a member of this group gives it per-missions FSNASVIEW, FSIPVIEW, FSLBVIEW, FSLANVIEW, and so on.

3. Initiate an ssh login using the created account.4. Observe that the alarm is raised and check that the user is not replicated to the NE's

internal LDAP RUIM cache fragment (fsFragmentId=security-ruim-cache,fsClus-terId=ClusterRoot). Login is not successful.

76 DN70398724Issue 03B

RNC OMS alarms

Id:0900d805805c5c9c

5. Clear the alarm manually.

DN70398724Issue 03B

77

RNC OMS alarms

Id:0900d80580501c06

1.38 70268 EXTERNAL LDAP FAILURE Probable cause: Underlying resource unavailable



MeaningA network element (NE) experiences problems with the connection to the NetAct external Lightweight Directory Access Protocol (LDAP) server. The alarm is raised for the following types of problems:

1. Both primary and secondary NetAct LDAP servers are down, unreachable, not responding within certain time, or replying with a return code indicating that LDAP is busy. This indicates a failure.

2. The NE account is not accepted by neither the primary nor the secondary NetAct LDAP servers. This case affects the long-term functionality. Note that if the NE is configured to fallback to the initial registration account if the NE account is invalid, an alarm in this case does not indicate inability of the NE to use NetAct LDAP, because initial registration account is used temporarily.

3. Both the NE account and the initial registration account are not accepted by neither the primary nor the secondary NetAct LDAP servers. This indicates a failure.

4. Bad LDAP data (for example, loops in referrals, too big a result set).

The NE is trying to contact the external NetAct LDAP server in the following scenarios:

1. The NE connects to the NetAct LDAP server to verify external user's password infor-mation.

2. The NE connects to the NetAct LDAP server to obtain external user's authorisation data. There are several use cases when this scenario is triggered:a) User authorisation data is fetched and replicated locally during the first login of

an external user into the NE or a login occuring after the replicated user account is removed from the NE's internal user database due to cache expiry. This scenario occurs after external user's password has been verified in the context of user authentication.

b) User authorisation data replication triggered by the NE Name Service Switch (NSS) module, for example by using id command.

c) User authorisation data is fetched and replicated after a command line interface (CLI) command using fsruimrepcli tool is issued. Please see the user manual page for fsruimrepcli for details.

d) User authorsation data is fetched and replicated due to a scheduled cache update. Scheduled cache updates are performed by RuimRepServer process of the RuimRep recovery group automatically and regularly with time interval in between replications as configured by the following property in RuimRepServer property file in /opt/Nokia_BP/SS_AAA/etc/:// automatic cache refresh interval in seconds ruim.replicator.refresh_interval

Problem 2 does not prevent any scenario from successful completion. If this problem occurs, an alarm is raised. Problems 1 and 3 prevent successful completion of all sce-narios and effect of those problems is described below:

• In scenarios 1 and 2a, external user's login is denied with the appropriate PAM (Pluggable Authentication Module) error code.

78 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580501c06

• In scenario 2b, there can be various problems related to user-to-group mappings for external users.

• In scenario 2c, no alarm is raised (CLI). • In scenario 2d, RuimRepServer process of the RuimRep recovery group performs a

time-based replication automatically and regularly with time interval in between rep-lications as configured by the following property in RuimRepServer property file in /opt/Nokia_BP/SS_AAA/etc/:// automatic cache refresh interval in seconds ruim.replicator.refresh_intervalIf the time-based replication fails due to the NetAct LDAP server unavailability (Problem 1), RuimRepServer process starts to recover from the failure by retrying the replication according to the following properties: // retry count incase of cache refresh failure ruim.replicator.refresh_retry_count // sleep between cache refresh tries in seconds ruim.replicator.refresh_retry_interval If LDAP is still not available after ruim.replicator.refresh_retry_count retries, RuimRepServer switches to the secondary NetAct LDAP server. An alarm is raised if the secondary NetAct LDAP server is not available either.

• In case of problems 2 and 3, during time-based replication, RuimRepServer raises an alarm without retrying. While RuimRepServer is experiencing problems with external LDAP it is possible that there were access control changes made in NetAct, which are not propagated to the NE. If there are users logged in, for example, the sessions still operate with permissions based on the last successfully replicated data.

Identifying additional information fields1. problem type (1 - NetAct LDAP not avaialble, 2 - NE account not usable, 3 - both NE account and initial registration accounts not usable, 4 - Bad data)

2. scenario (1 - PAM or NSS failure, 2 - RuimRepServer replication)

Additional information fieldsLDAP error code

number of retries (applicable for scenario with time-based replication (2d))

retry interval (in seconds as defined by the RuimRepServer properties)

InstructionsDepending on the problem type (see Identifying Application Additional Info) the cause for the problem can be:

• Network configuration problems.Check that the primary and secondary NetAct LDAP server addresses defined in the active configuration fragment under NWI3 Mediator fragment in internal LDAP server (fsFragmentId=mediator,fsFragmentId=NWI3,fsClusterId=ClusterRoot) are reachable with the ping command.

• NE account expired or deleted in NetAct.Check that the NE account stored in the internal LDAP server in the NWI3 security fragment configuration in LDAP (fsFragmentId=security, fsFragmen-tId=NWI3,fsClusterId=ClusterRoot ) exists also in the primary NetAct LDAP servers and has not expired.

DN70398724Issue 03B

79

RNC OMS alarms

Id:0900d80580501c06

• NetAct generated a new NE account, but the NE did not receive it.Check that the NE account username and password stored in the NE's internal LDAP fragment (fsFragmentId=security, fsFragmentId=NWI3,fsClusterId=Cluster-Root ) are the same as the ones stored in the NetAct LDAP server.

• NetAct LDAP is overloaded or shut down.

ClearingThe alarm is automatically cleared by the RuimRepServer when replication is success-ful. The alarm is also cleared when a new alarm with the same specific problem but with different Identifying Application Additional Info is raised by RuimRepServer.

Testing instructionsThe test setup must include an external LDAP server populated according to the NetAct remote user information management (RUIM) schema (/RUIMSCHEMA/).


• FlexiPlatform cluster is commissioned and up. • the connection with the external LDAP is established. • all RUIM-related RGs (RuimRep and PAP) are unlocked and enabled.

Execution scenario 1:

1. Shut both the primary and secondary NetAct LDAP servers down.2. Wait until the time-based replication is triggered according to the RuimRepServer

property. // automatic cache refresh interval in seconds ruim.replicator.refresh_interval

3. Observe the time delay before the alarm is raised. Alarm additional info must indicate the problem correctly. The delay is due to retry logic in the RuimRepServer controlled by the properties. // retry count incase of cache refresh failure ruim.replicator.refresh_retry_count // sleep between cache refresh tries in seconds ruim.replicator.refresh_retry_interval

4. Start the Primary NetAct LDAP server.5. Wait until the time-based replication is triggered.6. Observe that the alarm is automatically cleared by RuimRepServer.

☞ Configure the time-based replication to be frequent enough for testing purposes. For this scenario, set ruim.replicator.refresh_interval to a desired value and restart RuimRep RG.


1. Delete the account currently used by the NE from both the primary and secondary NetAct LDAP servers. Make sure that the initial registration account exists and is correctly configured in the NE.

2. Initiate an ssh login with an external account. 3. Observe that the login is successful and an alarm is raised. Alarm additional info

must indicate the problem correctly.


80 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580501c06

1. Delete the account currently used by the NE from both the primary and secondary NetAct LDAP servers. Make sure that the initial registration account does not exist.

2. Initiate an ssh login with an external account. 3. Observe that the login is denied and an alarm is raised. Alarm additional info must

indicate the problem correctly.

DN70398724Issue 03B

81

RNC OMS alarms

Id:0900d8058050bc60

1.39 70269 INVALID ACTIVE SESSIONSProbable cause: Database inconsistency


Default severity: Critical

MeaningCurrently there are open sessions to the Network Element (NE) that operate according to outdated authorisation profiles. This situation occurs when there are changes in NetAct Lightweight Directory Access Protocol (LDAP) affecting those external Remote User Information Management (RUIM) user accounts (or permissions associated with those accounts) which were replicated into the NE's local user database.

The change can be one of the following:

• The user account has been removed from NetAct. • The user account cannot be used to access the NE anymore. • The permissions associated with this account have changed in NetAct.

Currently there are active user sessions, opened before the above-mentioned changes were detected in the NE. Within those already created user sessions, access control changes are not automatically taken into effect. Users logged in with affected user accounts still continue to operate with the old permission set.

Note that only sessions maintained in /var/run/utmp are monitored. Currently only SSH sessions are monitored. Ftp sessions opened with vsftpd are also visible in /var/run/utmp, but ftp sessions are not possible with external user accounts accord-ing to the platform configuration. For other types of sessions, no alarm is raised.

This alarm can indicate that some users operate within the NE with higher permissions than allowed by NetAct according to a changed user account authorisation profile. There are four possible reasons for this:

1. A non-existent user is still logged into the NE (user account removed from NetAct).2. A user with no permissions for the NE is logged in (user account has been detached

from the NE according to RUIM Access Control Lists).3. A user has higher permissions than defined in NetAct (permissions for the user

account were lowered).4. A user has lower permissions than defined in NetAct (permissions for the user

account were raised).

Note that cases 1-3 indicate a security risk.

Identifying additional information fieldsusername, for which changes were detected

Additional information fieldschange type (user was removed or denied access to the NE (1), user's permissions changed (2))

InstructionsAll currently active ssh sessions based on user accounts mentioned in the Application Additional Info field of the alarm must be closed and reopened, if needed. After reopen-ing a session, correct permissions are taken into use, if the account is still in use for the NE.

82 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058050bc60

• To check open ssh sessions:1. Log into the active CLA.2. Execute the following command:

# utmpdump /var/run/utmpFor example, the result of invoking utmpdump may look as follows:

# utmpdump /var/run/utmp...[6] [06306] [co ] [LOGIN ] [ttyS1 ] [ ] [196.144.10.0 ] [Tue Nov 14 16:20:58 2006 EET][7] [32610] [ts/0] [testuser] [pts/0 ] [fle4gr01.ntc.nokia.com] [172.21.216.104 ] [Tue Nov 21 19:06:11 2006 EET][7] [32679] [ts/1] [testuser] [pts/1 ] [fle4gr01.ntc.nokia.com] [172.21.216.104 ] [Tue Nov 21 19:07:07 2006 EET][7] [32743] [ts/2] [testuser] [pts/2 ] [fle4gr01.ntc.nokia.com] [172.21.216.104 ] [Tue Nov 21 19:07:45 2006 EET][7] [00361] [ts/3] [testuser] [pts/3 ] [fle4gr01.ntc.nokia.com] [172.21.216.104 ] [Tue Nov 21 19:08:50 2006 EET][7] [17382] [ts/4] [root ] [pts/4 ] [flegrp13.ntc.nokia.com] [172.21.220.61 ] [Fri Dec 01 14:59:47 2006 EET][7] [01256] [ts/5] [extuser ] [pts/5 ] [esfleg03.ntc.nokia.com] [172.21.216.127 ] [Sun Dec 03 13:05:44 2006 EET][7] [04574] [ts/6] [root ] [pts/6 ] [esfleg02.ntc.nokia.com] [172.21.216.126 ] [Fri Dec 01 12:29:21 2006 EET]...

The preferred way of closing a session is a graceful exit. It is, however, possible to close it forcefully. The following example illustrates a forceful cleanup of a session for user extuser.1. First, check the sshd process ID of the child process of 01256:

# ps -ef | grep 1256root 1256 7701 0 13:05 ? 00:00:00 sshd: extuser [priv]10009 1276 1256 0 13:05 ? 00:00:00 sshd: extuser@pts/5root 2504 17382 0 13:06 pts/4 00:00:00 grep 1256

2. Terminate the session:# kill -9 1276

ssh session for user extuser is terminated. • Other instructions

The following gives some information about other types of sessions, even though they cannot be reported in this alarm. • Authorisation handling for Element Manager over Nwi3 (secure CORBA) implies

automatic refreshing of the authorisation data according to Nwi3Adapter Secure CORBA properties. To achieve faster refreshing of the authorisation data for Nwi3Adapter (for example, if you know that authorisation data for a logged-in user has changed), invoke the following command:

# fscorbaseccli -c updatetokenThis triggers the authorisation profile update for all users after at most as many seconds as specified by the property:com.nokia.flexiplatform.corba.security.cache.tokenpollrefresh.pollinterval in /opt/Nokia/SS_Nwi3Adapter/etc/secfwk.properties

• In Element Manager, over HTTP user's permissions are checked when access-ing a method (according to the default configuration), so the changed authorisa-tion profile is immediately taken into effect. The local NE LDAP sessions are not affected. It is not possible to bind to a local NE LDAP with an external account.

DN70398724Issue 03B

83

RNC OMS alarms

Id:0900d8058050bc60

ClearingClear the alarm with an alarm management application (for example, Alarm Browser) after correcting the fault as presented in Instructions.

Testing instructionsThe test setup must include an external LDAP server populated according to the RUIM schema.


• FlexiPlatform cluster is commissioned and up. • All RUIM-related RGs (RuimRep and PAP) are unlocked and enabled.

Execution scenario for ssh:

1. Open an ssh session to the NE using an account defined in RUIM LDAP, for example, extaccount. Check with command:

$ utmpdump /var/run/utmp that the session is opened. You get the following entry:

[7] [01505] [ts/1] [extaccount] [pts/1 ] [flegrp13.ntc.nokia.com] [172.21.220.61 ] [Sun Nov 26 19:36:26 2006 EET]

2. Remove extaccount from RUIM LDAP. Execute the following CLI command: $ fsruimrepcli --refreshcache

to enforce synchronisation between RUIM LDAP and the local replicated security fragment.

3. Observe that an alarm is displayed and it indicates user extuser as the one for which sessions should be restarted.

4. Check that there is an sshd process corresponding to the session.# ps -ef | grep extuserroot 1505 26013 0 19:36 ? 00:00:00 sshd: extuser [priv]10009 1584 1505 0 19:36 ? 00:00:00 sshd: extuser@pts/1 - ssh session

5. Terminate the process:# kill -9 1584

6. Observe that the session is terminated.7. Try to login again using account extuser.

Access must be denied.

84 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058050a397

1.40 70280 UNKNOWN SPECIFIC PROBLEMProbable cause: Configuration or customising error


Default severity: 5 Warning

MeaningThis alarm indicates detection of an alarm notification for a specific problem (alarm number) that is unknown to the alarm system (the corresponding alarm type isn't defined in the reference data).

The unknown specific problem can be the result of either using a dynamic alarm type (a type that isn't inherently predefined and correspondingly not ported to the alarm system) or a mistake due to a missing import of the existing alarm definition in the alarm system.

The alarm system creates a new type of alarm on the fly, using data from the alarm noti-fication to set the alarm type parameters.

This alarm type is stored persistently in the alarm system database reference data and applied to subsequent new alarm notifications that contain the specific problem in ques-tion, which results in no longer raising alarm 70280 in a case of the recently registered specific problem.


1. Unknown specific problem in the original alarm notification.

Instructions

1. The alarm either announces the use of a dynamic alarm type in alarm notification or indicates an undefined alarm in the alarm system (the exact reason can be identified by checking the list of known alarms in the customer documentation). In the latter case contact your Nokia Siemens Networks representative to upgrade the system with the missing alarm definition.

The alarm system creates a new alarm type using the following values for its parame-ters:

A. Static Parameters:

Parameter Value

DN70398724Issue 03B

85

RNC OMS alarms

Id:0900d8058050a397

B. Dynamic Parameters:

Alarm text the value of a special field in the alarm notification; if the field isn't defined then the text takes the following form: "ALARM NNN" where NNN is the specific problem in question.

Probable cause

0 (INDETER-MINATE).

Event type Environmen-tal.

Specific problem

the specific problem in question.

Clearing info automatic clearing.

Parameter Value

Default severity

the perceived severity of the alarm notifica-tion; if it isn't set then the INDETERMI-NATE value is used.

86 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058050a397

If required, the static parameters can be changed with an alarm management applica-tion.



1. Raise any unknown test alarm with the flexalarm tool - for instance 79999: # flex-alarm --raise --sp=79999 --mo=/ --ap=/CLA-0/TestRU/TestApp --se=3

2. Using an alarm management application, observe that a new alarm type with the parameters described in the Instructions field has been added to the alarm system reference data.

3. Observe that alarm 70280 has also been raised.

Autoacknowl-edgment

yes, if the fsParameterId=fsAutoAckedDAT, fsAlarmProcessorConfigurationId=Default, fsAlarmProcessorId=AlarmProcessor1, fsFragmentId= AlarmProcessors, fsFragmentId=AlarmMgmt, fsClusterId=ClusterRootdefined attribute in the alarm system LDAP config-uration is set to "true"; oth-erwise - no.

Switch over update

no

Clearing delay

0.

Informing delay

0.

Time to live 0.

Operation Instructions

"Not defined".

DN70398724Issue 03B

87

RNC OMS alarms

Id:0900d8058058658f

1.41 71000 PM FTP CONNECTION FAILEDProbable cause: Communication Protocol Error



MeaningFile transfer operation failed when trying to download measurement file. IP-address in the additional information field tells which interface the problem concerns.

This alarm will not be set immediately after a file transfer operation fails, but only after the file transfer has failed to the same IP-address consecutively over the duration defined by LDAP parameter OMS/OMSRNC/SS_RNCPM/OMSMeaHandler/BTSFTPAlarmSetDelay.

Measurement data may be lost or delayed.

Identifying additional information fields1. IP-address of FTP/HTTP/HTTPS server

Additional information fields2. Cause information: "Connect_failed", "Get_failed", "Other_error"

3. Network element identifier ("WBTS-xxx" for BTS-failures, "OMU" for OMU failures)

InstructionsNormally the alarm does not need to be cleared but the system cancels the alarm auto-matically when the file transfer operation is successful. However, if the related network element is removed altogether from the network or its IP-address is changed, it may be necessary to cancel the alarm manually using Element Manager.

ClearingDo not clear the alarm. System cancels the alarm automatically.

88 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580586590

1.42 71001 MEASUREMENT DATA NOT TRANSFERREDProbable cause: Queue Size Exceeded



MeaningThe number of files waiting to be transferred to NetAct has exceeded a defined thresh-old.

Some measurement data may not have been transferred to NetAct or file transfer acknowledgements from NetAct to OMS are not working correctly.



InstructionsAlarm System will clear the alarm when the amount of untransferred files decreases below a defined threshold. If NetAct connection is wanted to be disabled, the alarm will get cancelled automatically within 10 minutes after setting LDAP parameter PMFileBufferAlarmEnabled to value 0 (zero).

ClearingDo not clear the alarm. System cancels the alarm automatically.

DN70398724Issue 03B

89

RNC OMS alarms

Id:0900d80580586591

1.43 71002 MEASUREMENT DATA ERRORProbable cause: Corrupt data



MeaningMeasurement file could not be processed.

Some measurement data could have been lost due to invalid measurement file content.


Additional information fields1. Error info, possible values: "Decompression_failed", "File_corrupted", "Other_failure"

2. File name

3. IP-address of data provider

4. Detailed error code for troubleshooting

InstructionsDoes not require any user actions.

ClearingDo not clear the alarm. Alarm System will clear the alarm automatically.

90 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580586592

1.44 71003 OMS MEASUREMENT DATA PROCESSING OVERLOADProbable cause: System Resources Overload



MeaningThe time used for processing performance measurement data in OMS has exceeded the defined limit. This does not necessarily indicate any loss of measurement data but the measurement parameters should be changed to decrease load and prevent possible problems caused by the overload.

The limits used to set and cancel this alarm can be changed by the user from OMS LDAP parameters.

Too much measurement data is produced in the RNC and OMS overload causes a risk for losing some data.

Identifying additional information fields1. Measurement category, possible values: "RNW_meas", "Transm_hw_meas", "WBTS_meas"


InstructionsDoes not require any user actions.

ClearingDo not clear the alarm. Alarm System will clear the alarm after data processing load had decreased to normal level.

DN70398724Issue 03B

91

RNC OMS alarms

Id:0900d8058053d929

1.45 71005 THRESHOLD MONITORING LIMIT EXCEEDEDProbable cause: Threshold Crossed



MeaningThreshold monitoring makes it easier to detect faults, identify bottlenecks and optimise the network. Using Element Manager GUI, appropriate performance thresholds are determined for each important variable, and exceeding these thresholds indicates a problem worth of attention. These variables can be either single counters or Key Perfor-mance Indicators (KPIs), which can be a combination of several counters.

When performance data is gathered on variables of interest from the measured objects in the network, their values are compared against any active threshold limits. When a performance threshold is exceeded, an alarm is generated and sent to the network man-agement system. In addition to this, a threshold event log is saved in the network element for further study of the events which have occurred in that NE during a certain period of time.

When this alarm has been triggered, it means that a threshold monitoring rule has been evaluated as true by OMS. The object of this alarm is always OMS, even if the threshold rule had been targeted to some other measured object. The real object of the threshold alarm and more information on the event can be seen with the NE Threshold Manage-ment application.

The effect of this alarm is dependent on what is the operator defined threshold rule that triggered the alarm setting.

Identifying additional information fields1. Measurement type

2. Threshold rule name

InstructionsThreshold alarm does not necessarily mean that there are problems in the network element, because thresholds can be freely set by the operator, and some rules may have been set incorrectly.

To get further information on the reason of the threshold alarm:

- Connect to the network element with Element Manager.

- Open the NE Threshold Management application.

- Select "Show Threshold Log" from the View menu and check the threshold log.

When the target object and other information of the alarm have been checked from the log, you can obtain more information from the performance counters of measurements and, if necessary, take appropriate action to correct the problem. The counters can be browsed either by using Element Manager applications (NE Measurement Explorer or RNW Measurement Presentation) or by using NetAct reporting tools.

ClearingDo not clear the alarm. This alarm is cancelled automatically by the system after 15 seconds.

92 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058053d92a

1.46 71006 WCEL THRESHOLD MONITORING LIMIT EXCEEDEDProbable cause: Threshold Crossed



MeaningWhen this alarm has been triggered, it means that a threshold monitoring rule has been evaluated as true by OMS for some WCDMA cell object in the Cell Resource measure-ment. More information on the event can be seen with the NE Threshold Management application in OMS Element Manager.

Threshold monitoring makes it easier to detect faults, identify bottlenecks and optimise the network. Using OMS Element Manager GUI, appropriate performance thresholds are determined for each important variable, and exceeding these thresholds indicates a problem worth of attention. These variables can be either single counters or Key Perfor-mance Indicators (KPIs), which can be a combination of several counters.










When the target object and other information of the alarm have been checked from the log, you can obtain more information from the performance counters of measurements and, if necessary, take appropriate action to correct the problem. The counters can be browsed with RNW Measurement Presentation application or by using NetAct reporting tools.

ClearingDo not clear the alarm. This alarm has a lifetime of 65 minutes.

DN70398724Issue 03B

93

RNC OMS alarms

Id:0900d8058053d92b

1.47 71007 MEASUREMENT THRESHOLD MONITORING LIMIT EXCEEDEDProbable cause: Threshold Crossed



MeaningWhen this alarm has been triggered, it means that a threshold monitoring rule has been evaluated as true by OMS for some WCDMA cell object in some other RNW measure-ment than Cell Resource measurement for which threshold limit breaks are reported with alarm WCEL THRESHOLD MONITORING LIMIT EXCEEDED. More information on the event can be seen with the NE Threshold Management application in OMS Element Manager.

Threshold monitoring makes it easier to detect faults, identify bottlenecks and optimise the network. Using OMS Element Manager GUI, appropriate performance thresholds are determined for each important variable, and exceeding these thresholds indicates a problem worth of attention. These variables can be either single counters or Key Perfor-mance Indicators (KPIs), which can be a combination of several counters.










When the target object and other information of the alarm have been checked from the log, you can obtain more information from the performance counters of measurements and, if necessary, take appropriate action to correct the problem. The counters can be browsed with RNW Measurement Presentation application or by using NetAct reporting tools.

94 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058053d92b

ClearingDo not clear the alarm. This alarm has a lifetime of 65 minutes.

DN70398724Issue 03B

95

RNC OMS alarms

Id:0900d80580654be7

1.48 71050 OMS EMT CONNECTION COULD NOT BE OPENEDProbable cause: Communication Protocol Error



MeaningNew EMT connection could not be opened from OMS to the network element.

Application using the EMT connection will not work properly, because opening the com-munication between the application and network element does not succeed. The appli-cation will work properly after the EMT connection succeeds.

Identifying additional information fields1. IP address of the failed target


Instructions1. Check OMS EMT configuration as described in Integrating RNC OMS document.

2. Verify that OMS is reachable from the NE using ping from NE to OMS (MML command ZQRX).

3. Check OMS syslog.

4. Verify cabling between RNC OMU, ESA ethernet switch and OMS.

If none of the mentioned steps helps, collect OMS syslog and contact the local Nokia Siemens Networks representative.

ClearingDo not clear the alarm. The system cancels the alarm automatically after its lifetime has elapsed.

96 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580654f11

1.49 71051 OMS EMT CONTROL CONNECTION FAILUREProbable cause: Communication Protocol Error



MeaningOMS maintains a control connection to each network element that it manages. One of those connections has failed.

Application using the EMT connection does not work properly, because the connection between the application and network element has been broken.



InstructionsThe error can be caused by many different reasons (Network failure, configuration error, for example a faulty IP address, out of memory, load is too high, Network element reset or any network related failure in the network element and so on).

See Troubleshooting RNC OMS for how to fix the problem and if that does not provide a solution, contact the local Nokia Siemens Networks representative.

ClearingDo not clear the alarm. The system cancels the alarm automatically after its lifetime has elapsed.

DN70398724Issue 03B

97

RNC OMS alarms

Id:0900d80580586594

1.50 71052 OMS FILE TRANSFER CONNECTION COULD NOT BE OPENEDProbable cause: Communication Protocol Error



MeaningStarting a new file transfer connection has failed.

File transfer between OMS and target network element is not working.


Additional information fields2. URL of the failed target

InstructionsThe error can be caused by many different reasons (configuration error, for example a faulty IP address, out of memory, load is too high, and so on). To find out the reason for the error:

1. Open web browser.

2. Go to page https://<OMS IP address>/

3. Select "Element Manager Login" and open Log Viewer.

4. Check the log for errors.

If the problem persists, see Troubleshooting RNC OMS document for how to fix the problem. If that does not provide a solution, contact the local Nokia Siemens Networks representative.

ClearingAlarm will be cleared automatically by the alarm system after its time to live has expired. This alarm has a lifetime of 10 minutes

98 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580586596

1.51 71053 O&M SUPPORT FOR INTEGRATED 3RD PARTY DEVICESProbable cause: Communication Protocol Error



MeaningThis alarm is created by the O&M Support for the Integrated 3rd Party Devices feature. Some monitored IP address or port is down.

If IP-address is down it cannot be reached, and if the device is, for example, a switch, all devices that are “under” this switch cannot be reached either. If it is a port that goes down then any active device connected to that port will change state to non-active or become disconnected.

Identifying additional information fields1. IP address (and port)


InstructionsCheck why the IP address (and port) are failed and fix the failure according to the guide-lines of the monitored device's manufacturer. “O&M Support for Integrated 3rd Party Devices” feature will automatically cancel the alarm when the failure is fixed.

ClearingDo not cancel the alarm. System cancels the alarm automatically.

DN70398724Issue 03B

99

RNC OMS alarms

Id:0900d80580586597

1.52 71054 WCDMA BTS O&M MEDIATION FAILUREProbable cause: Communication Protocol Error



MeaningNWI3 connection problem between RNC and NetAct. This alarm is set by OMS when WBTS O&M operation reply sending from OMS to NetAct fails.

In case of NWI3 problem the WBTS O&M mediation tasks done by OMS unit cannot be performed (SW download, SW version upload, HW configuration upload).



InstructionsAfter the problem in NWI3 connection has been corrected, the system will cancel the alarm only after next O&M mediation event is sent to NetAct successfully. Thus it is normal behaviour that alarm stays active for a while after the problem has been cor-rected.

ClearingDo not clear the alarm. This alarm is cancelled automatically by the system.

100 DN70398724Issue 03B

RNC OMS alarms

Id:0900d80580586598

1.53 71055 NETWORK ELEMENT RESTARTEDProbable cause: Indeterminate



MeaningOMS has received an indication about imminent network element restart.

As a result of this alarm the event flow from the network element to NetAct is not working. After the network element restart is over the event flow is working again.



InstructionsNo user actions are required.

ClearingThis is an informative alarm and will be cleared automatically by the alarm system after its time to live has expired.

DN70398724Issue 03B

101

RNC OMS alarms

Id:0900d80580586599

1.54 71057 RNW NOTIFICATION MISSINGProbable cause: Communication Protocol Error



MeaningThe RNC (Radio Network Controller) sends notifications to the NMS when the radio network database has been updated. All the notifications related to the RNW database cannot be sent from the RNC to the NMS. The reason is buffer overflow in the RNC, or notification handling error in the RNC or in the NMS. The buffer is in the Operation and Maintenance Unit (OMU) of the RNC. The alarm is set from OMS.

There might be incoherent information in the NMS about the RNC radio network data-base.



InstructionsUpload the information related to the NWI3 fragment in question from the NMS to get the radio network information up-to-date.

If the alarm is set again short time after cancelling, there may be an error in the event buffering system in the OMU.

Follow the steps:

1. Shutdown /RNWEvent recovery group in the OMS with the command:fshascli --shutdown /RNWEvent.

2. Restart the EEFPRB process in the OMU with service terminal command:ZOG:49F,0,0,0.

3. Wait until the EEFPRB has been restarted (about two minutes).4. Unlock /RNWEvent recovery group with command:

fshascli --unlock /RNWEvent.5. If the restarting of the EEFPRB process does not help, restart the whole OMU unit

and /RNWEvent recovery group in the OMS.6. Execute the upload operation in the NMS.

ClearingCancel the alarm with FM GUI after correcting the fault. See document Managing faults with RNC OMS.

102 DN70398724Issue 03B

RNC OMS alarms

Id:0900d8058058659a

1.55 71088 MMI CONNECTION FAILUREProbable cause: Indeterminate



MeaningThe alarm is set by OMS when OMS MML Parser library cannot connect to RNC OMU MMI interface using SSH.

SSH connections made by MML Parser cannot be established from OMS.



InstructionsThe error can be caused by many different reasons (configuration error, for example a faulty IP address, out of memory, load is too high, and so on). To find out the reason for the error:1. Open web browser.2. Go to page https://<OMS IP address>/3. Select "Element Manager Login" and open Log Viewer.4. Check the log for errors.If the problem persists, see Troubleshooting RNC OMS document for how to fix the problem. If that does not provide a solution, contact the local Nokia Siemens Networks representative.

ClearingThe alarm is automatically cancelled when OMS MML Parser application can again connect to RNC OMU MMI interface using SSH, or when 10 minutes has elapsed since the alarm was set.

DN70398724Issue 03B

103

RNC OMS alarms

Id:0900d8058058659b

1.56 71091 OVERFLOW ALARM FROM EXTERNAL SYSTEMProbable cause: Indeterminate



MeaningExternal system has sent an alarm without proper consecutive number to OMS. This alarm does not have a proper consecutive number because of alarm overflow situation in the external system and it cannot be handled in normal way. Therefore, OMS raises a new alarm with the information received in the overflow alarm. This alarm is used to capsulate the data of the overflown alarm.

Note that as this alarm was overflown in the external system it may not appear in the user interfaces of that particular system at all.

Identifying additional information fields1. A combination field that contains the key characteristics of the overflown alarm. This field contains alarm number, alarm text, target object of the fault and alarm additional information. The items are combined into a single string and separated using under-score character.

Additional information fields2. Possible other additional information related to the overflown alarm.

InstructionsSee the 1st additional information field for original alarm number and text and check the documentation of the original alarm for details how to handle the fault situation.

ClearingDo not cancel the alarm. The system will automatically clear the alarm after the defined life time for the alarm has elapsed.

Documents

Alm Rnc Oms Alarms