11gR2 Clusterware Technical Wp

7/18/2019 11gR2 Clusterware Technical Wp

http://slidepdf.com/reader/full/11gr2-clusterware-technical-wp 1/108

Oracle Clusterware 11g Release 2 (11.2)Technical White Paper

Internal / Confidential

Version 1.0 update 3



Oracle White Paper— Oracle Clusterware 11g Release 2 (11.2) Technical White Paper

1

1 Oracle Clusterware Architecture ................................................ 3

1.1 Oracle Clusterware Daemons and Agent Overview............ 3

1.2 Oracle High Availability Service Daemon (OHASD): .......... 4

1.3 Agents .............................................................................. 11

1.4 Cluster Synchronization Services (CSS): ......................... 14

1.5 Cluster Ready Services (CRS): ........................................ 35

1.6 Grid Plug and Play (GPnP) .............................................. 40

1.7 Oracle Grid Naming Service (GNS):................................. 46

1.8 Grid Interprocess Communication .................................... 52

1.9 Cluster time synchronization service daemon (CTSS): ..... 55

1.10 mdnsd .............................................................................. 57

2 Voting Files and Oracle Cluster Repository Architecture .......... 58

2.1 Voting File in ASM............................................................ 58

2.2 Voting File Changes ......................................................... 58

2.3 Oracle Cluster Registry (OCR) ......................................... 60

2.4 Oracle Local Registry (OLR) ............................................ 61

2.5 Bootstrap and Shutdown if OCR is located in ASM .......... 62 2.6 OCR in ASM diagnostics .................................................. 62

2.7 The ASM Diskgroup Resource ......................................... 63

2.8 The Quorum Failure Group .............................................. 64

2.9 ASM spfile ........................................................................ 65

3 Resources ............................................................................... 66

3.1 Resource types ................................................................ 66

3.2 Resource Dependencies .................................................. 77

4 Fast Application Notification (FAN) .......................................... 80

4.1 Event Sources .................................................................. 80

4.2 Event Processing architecture in oraagent ....................... 80




2

5 Configuration best practices ..................................................... 84

5.1 Cluster interconnect ......................................................... 84

5.2 misscount ......................................................................... 84

6 Clusterware Diagnostics and Debugging ................................. 85

6.1 Check Cluster Health ....................................................... 85

6.2 crsctl command line tool ................................................... 86

6.3 Trace File Infrastructure and Location .............................. 87

6.4 OUI / SRVM / JAVA related GUI tracing ........................... 89

6.5 Reboot Advisory ............................................................... 89

7 Other Tools .............................................................................. 91

7.1 ocrpatch ........................................................................... 91

7.2 vdpatch ............................................................................ 91

7.3 Appvipcfg – adding an application VIP ............................. 96

7.4 Application and Script Agent ............................................ 97

7.5 Oracle Cluster Health Monitor - OS Tool (IPD/OS) ......... 100

8 Appendix ................................................................................ 106




3

Oracle Clusterware 11g Release 2 (11.2)

Introduction

The transition to Oracle Database 11g release 2 (11.2), Oracle Clusterware introduced a

wide array of changes, ranging from a complete redesign of CRSD, introduction of a "local

CRS" (OHASD) and replacing the RACG layer with a tightly integrated AGENT layer, to new

features such as Grid Naming Service, Grid Plug and Play, Cluster time synchronization

service and Grid IPC. Cluster synchronization service (CSS) is probably the layer that seems

least affected by the 11.2 changes, but it provided functionality to support the new features,

as well as added new functionality such as IPMI support.

With this technical paper, we wanted to take the opportunity to provide all the know-how

that we have accumulated over the 11.2 development years, and relay it to everyone else

who is just starting out learning Oracle Clusterware 11.2. The paper provides conceptual

overviews as well as detailed information related to diagnostics and debugging.

Since this is the first version of the Oracle Clusterware diagnostics paper, not all components

of the 11.2 stack are covered equally detailed. If you feel you can make a contribution to the

paper, please let us know.

Disclaimer

The information contained in this document is subject to change without notice. If you find

any problems in this paper, or have any comments, corrections or suggestions, please report

them to us via E-Mail (mailto:[email protected]). We do not warrant that this

document is error-free. No part of this document may be reproduced in any form or by any

means, electronic or mechanical, for any purpose, without the permission of the authors.

This document is for internal use only and may not be distributed outside of Oracle.

1 Oracle Clusterware Architecture

This section will describe the main Oracle Clusterware daemons.

1.1 Oracle Clusterware Daemons and Agent Overview

The diagram below is a high level overview about the daemons, resources and agents used

in Oracle Clusterware 11g release (11.2).




4

The first big change noticed between pre-11.2 and 11.2 is the new OHASD daemon, which is

replacing all the known init scripts which exist in pre-11.2.

#

Figure 1: Resource startup chart.

1.2 Oracle High Availability Service Daemon (OHASD):

Oracle Clusterware consists of two separate stacks. The upper stack anchored by the Cluster

Ready Services daemon (crsd) and a lower stack anchored by the Oracle High Availability

Services daemon (ohasd). These two stacks have several processes that facilitate cluster

operations. The following chapters will describe them in detail.




5

The OHASD is the daemon which starts every other daemon that is part of the Oracle

Clusterware stack on a node. OHASD will replace all the pre-11.2 existing init scripts.

The entry point for OHASD is /etc/inittab, which executes the /etc/init.d/ohasd and/etc/init.d/init.ohasd control scriptsThe /etc/init.d/ohasd script is a RC script including the

start and the stop actions. The /etc/init.d/init.ohasd script is the OHASD framework control

script which will spawn the Grid_home/bin/ohasd.bin executable.

The cluster control files are located in /etc/oracle/scls_scr/<hostname>/root (this is the

location for Linux) and maintained by crsctl; in other words, a “crsctl enable / disable crs”

will update the files in this directory.

# crsctl enable -h

Usage:

crsctl enable crs

Enable OHAS autostart on this server

# crsctl disable –h

Usage:

crsctl disable crs

Disable OHAS autostart on this server

The content of the file scls_scr/<hostname>/root/ohasdstr controls the autostart of the CRS

stack; the two possible values in the file are “enable” – autostart enabled, or “disable” –

autostart disabled.

The file scls_scr/<hostname>/root/ohasdrun controls the init.ohasd script. The three

possible values are “reboot” – sync with OHASD, “restart” – restart crashed OHASD, “stop” –

scheduled OHASD shutdown.

The big benefit of having OHASD in Oracle Clusterware 11g release 2 (11.2) is the ability to

run certain crsctl commands in a clusterized manner. Clusterized commands are completely

operating system independent, as they only rely on ohasd. If ohasd is running, then remote

operations, such as the starting, stopping, and checking the stack status of remote nodes,

can be performed.

Clusterized commands include the following:

– crsctl check cluster

– crsctl start cluster

– crsctl stop cluster

There are more functions that OHASD is performing, such as processing and managing the

Oracle Local Repository (OLR), as well as acting as the OLR server. In a cluster, OHASD runs

as root; in an Oracle Restart environment, where OHASD manages application resources, it

runs as the oracle user.




6

1.2.1 OHASD Resource Dependency

The clusterware stack in Oracle Clusterware 11g release 2 (11.2) is started by the OHASD

daemon, which itself is spawned by the script /etc/init.d/init.ohasd when a node is started.

Alternatively, ohasd is started on a running node with ‘crsctl start crs’ after a prior ‘crsctl

stop crs’. The OHASD daemon will then start other daemons and agents. Each Clusterware

daemon is represented by an OHASD resource, stored in the OLR. The chart below shows

the association of the OHASD resources / Clusterware daemons and their respective agent

processes and owner.

Resource Name Agent Name Owner

ora.gipcd oraagent crs user

ora.gpnpd oraagent crs user

ora.mdnsd oraagent crs user

ora.cssd cssdagent Root

ora.cssdmonitor cssdmonitor Root

ora.diskmon orarootagent Root

ora.ctssd orarootagent Root

ora.evmd oraagent crs user

ora.crsd orarootagent Root

ora.asm oraagent crs user

ora.driver.acfs orarootagent Root

ora.crf (new in 11.2.0.2) orarootagent root

Figure 2: Resource and Agent associated table.




7

The picture below shows all the resource dependencies between OHASD managed resources

/ daemons:

MDNSD GIPCD

GPNPD

CSSDMONITORDISKMON CSSD

CTSSD

EVMDCRSD

START:weakSTART:weakSTOP:hard(intermediate)

START:weak

START:hard

STOP:hard(intermediate)

START:weak

STOP:hard

STOP:hardSTART:hard , pullup

STOP:hard(intermediate)START:hard , pullup

START:hard , pullup

STOP:hard (intermediate)START:hard , pullup

START:hard , pul lup

START:weak(concurrent),pullup(always)

Figure 3: For details regarding the hard/weak and pullup/intermediate resource dependencies see to 3.2.




8

1.2.2 Daemon Resources

A typical daemon resource list from a node is listed below. To get the daemon resources list

we need to use the –init flag with the crsctl command.

# crsctl stat res -init –t

--------------------------------------------------------------------------NAME

TARGET STATE SERVER STATE_DETAILS

--------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------

ora.asm

1 ONLINE ONLINE node1 Started

ora.crsd

1 ONLINE ONLINE node1

ora.cssd


ora.cssdmonitor


ora.ctssd

1 ONLINE ONLINE node1 OBSERVER

ora.diskmon


ora.drivers.acfs


ora.evmd


ora.gipcd


ora.gpnpd


ora.mdnsd


The list below will show the types used, and the hierarchy. Everything is built on the base

“resource” type. The cluster_resource is using the “resource” type as base type. Using the

cluster_type as base type we build the ora.daemon.type which is the building block for e.g.

the ora.cssd.type and all the other daemon resources.

To print the “internal” resource type names and resources use the crsctl –init flag.

# crsctl stat type -init




9

TYPE_NAME=application

BASE_TYPE=cluster_resource

TYPE_NAME=cluster_resource

BASE_TYPE=resource

TYPE_NAME=local_resource

BASE_TYPE=resource

TYPE_NAME=ora.asm.type

BASE_TYPE=ora.daemon.type

TYPE_NAME=ora.crs.type


TYPE_NAME=ora.cssd.type


TYPE_NAME=ora.cssdmonitor.type


TYPE_NAME=ora.ctss.type


TYPE_NAME=ora.daemon.type


TYPE_NAME=ora.diskmon.type


TYPE_NAME=ora.drivers.acfs.type


TYPE_NAME=ora.evm.type


TYPE_NAME=ora.gipc.type


TYPE_NAME=ora.gpnp.type


TYPE_NAME=ora.mdns.type


TYPE_NAME=resource

BASE_TYPE=




10

Using the ora.cssd resource as an example, all the ora.cssd attributes can be shown using

crsctl stat res ora.cssd –init –f (note, not all the attributes are listed in the below example,

onyl the most important one).

# crsctl stat res ora.cssd -init –f

NAME=ora.cssd

TYPE=ora.cssd.type

STATE=ONLINE

TARGET=ONLINE

ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r--,user:oracle11:r-x

AGENT_FILENAME=%CRS_HOME%/bin/cssdagent%CRS_EXE_SUFFIX%

CHECK_INTERVAL=30

ocssd_PATH=%CRS_HOME%/bin/ocssd%CRS_EXE_SUFFIX%

CSS_USER=oracle11

ID=ora.cssd

LOGGING_LEVEL=1

START_DEPENDENCIES=weak(ora.gpnpd,concurrent:ora.diskmon)hard(ora.cssdmonitor)

STOP_DEPENDENCIES=hard(intermediate:ora.gipcd,shutdown:ora.diskmon)

In order to debug daemon resources the –init flag must be used all the time. To enable

additional debugging for e.g. ora.cssd:

# crsctl set log res ora.cssd:3 -init

To check a log level:

# crsctl get log res ora.cssd –init

Get Resource ora.cssd Log Level: 3

To check resource properties like logging level run:

# crsctl stat res ora.cssd -init -f | grep LOGGING_LEVEL

DAEMON_LOGGING_LEVELS=

LOGGING_LEVEL=3




11

1.3 Agents

Oracle Clusterware 11g Release 2 (11.2) introduces a new agent concept which makes the

Oracle Clusterware more robust and performant. These agents are multi-threaded daemonswhich implement entry points for multiple resource types and which spawn new processes

for different users. The agents are highly available and besides the oraagent, orarootagent

and cssdagent/cssdmonitor, there can be an application agent and a script agent.

The two main agents are the oraagent and the orarootagent. Both ohasd and crsd employ

one oraagent and one orarootagent each. If the CRS user is different from the ORACLE user,

then crsd would utilize two oraagents and one orarootagent.

1.3.1 oraagent

ohasd’s oraagent:

– Performs start/stop/check/clean actions for ora.asm, ora.evmd, ora.gipcd,ora.gpnpd, ora.mdnsd

crsd’s oraagent:

– Performs start/stop/check/clean actions for ora.asm, ora.eons, ora.LISTENER.lsnr,

SCAN listeners, ora.ons

– Performs start/stop/check/clean actions for service, database and diskgroup

resources

– Receives eONS events, and translates and forwards them to interested clients

(eONS will be removed and its functionality included in EVM in 11.2.0.2)

– Receives CRS state change events and dequeues RLB events and enqueues HAevents for OCI and ODP.NET clients

1.3.2 orarootagent

ohasd’s orarootagent:

– Performs start/stop/check/clean actions for ora.crsd, ora.ctssd, ora.diskmon,

ora.drivers.acfs, ora.crf (11.2.0.2)

crsd’s orarootagent:

– Performs start/stop/check/clean actions for GNS, VIP, SCAN VIP and network

resources

1.3.3 cssdagent / cssdmonitor

Please refer to the chapter “cssdagent and cssdmonitor”.




12

1.3.4 Application agent / scriptagent

Please refer to the chapter “application and scriptagent”.

1.3.5 Agent Log Files

The log files for the ohasd/crsd agents are located in

Grid_home/log/<hostname>/agent/{ohasd|crsd}/<agentname>_<owner>/<agentname>_<o

wner>.log. For example, for ora.crsd, which is managed by ohasd and owned by root, the

agent log file is named

Grid_home/log/<hostname>/agent/ohasd/orarootagent_root/orarootagent_root.log

The same agent log file can have log messages for more than one resource, if those

resources are managed by the same daemon.

If an agent process crashes,

– a core file will be written toGrid_home/log/<hostname>/agent/{ohasd|crsd}/<agentname>_<owner>, and

– a call stack will be written to

Grid_home/log/<hostname>/agent/{ohasd|crsd}/<agentname>_<owner>/<agentna

me>_<owner>OUT.log

The agent log file format is the following:

<timestamp>:[<component>][<thread id>]…

<timestamp>:[<component>][<thread id>][<entry point>]…

Example:

2009-10-07 13:25:00.640: [ora.ctssd][2991836048] [check] In code translate, return

= 0, state detail = OBSERVER

2009-10-07 13:25:08.545: [ AGFW][2991836048] check for resource: ora.diskmon 1

1 completed with status: ONLINE

2009-10-07 13:25:18.231: [ora.crsd][2991836048] [check] DaemonAgent::check

returned 0

2009-10-07 13:25:18.231: [ora.crsd][2991836048] [check] CRSD Deep Check

If any error occurs, the entry points for determining what happened are:

– clusterware alert log file Grid_home/log/<hostname>/alert<hostname>.log

– OHASD/CRSD log file

Grid_home/log/<hostname>/ohasd/ohasd.log

Grid_home/log/<hostname>/crsd/crsd.log




13

– The corresponding agent log file.

Bear in mind that one agent log file will contain the start/stop/check for multiple

resources. Taking the crsd orarootagent as example in case of a SCAN VIP failure,

grep for e.g. the resource name “ora.scan2.vip”.

2009-11-25 06:20:24.766: [ora.scan2.vip] [check] Checking if IP 10.137.12.214 is

present on NIC eth0

2009-11-25 06:20:24.766: [ AGFW] check for resource: ora.scan2.vip 1 1

completed with status: ONLINE

2009-11-25 06:20:25.765: [ AGFW] CHECK initiated by timer for: ora.scan2.vip 1

1

2009-11-25 06:20:25.767: [ AGFW] Executing command: check for resource:

ora.scan2.vip 1 1

2009-11-25 06:20:25.768: [ora.scan2.vip] [check] Checking if IP 10.137.12.214 is

present on NIC eth0

2009-11-25 06:20:25.768: [ AGFW] check for resource: ora.scan2.vip 1 1

completed with status: ONLINE

2009-11-25 06:20:26.767: [ AGFW] CHECK initiated by timer for: ora.scan2.vip 1

1

2009-11-25 06:20:26.769: [ AGFW] Executing command: check for resource:

ora.scan2.vip 1 1






15

reconfig manager thread writes an eviction notification (a.k.a. a kill block)

to the voting files. The RMT also sends a shutdown message to the victim.

Voting file heartbeats are monitored for split-brain checking, and remote

nodes are not considered gone until their disk-heartbeats have ceased for

<misscount> seconds.

– Discovery thread – for voting file discovery

– Fencing thread – for communicating with the diskmon process for I/O fencing, if

EXADATA is used.

1.4.2 Voting File Cluster Membership Threads

– Disk Ping thread – (one per voting file)

o writes the current view of cluster membership along with an incarnation

number and incrementing sequence number to voting file with which it is

associated, and

o Reads the kill block to see if its host node has been evicted.

o This thread also monitors the voting-disk heartbeat for remote nodes. The

disk heartbeat information is used during reconfigurations in order to

determine whether a remote ocssd has terminated.

– Kill Block thread – (one per voting file) monitors voting file availability to ensure a

sufficient number of voting files are accessible. If Oracle redundancy is used, werequire the majority of the configured voting disks online.

– Worker thread – (new in 11.2.0.1, 1 per voting file) miscellaneous I/O to voting files

– Disk Ping Monitor – monitors I/O voting file status

o This thread watches to ensure that disk ping threads are correctly reading

their kill blocks on a majority of the configured voting files. If we can’t

perform I/O to the voting file(s) due to I/O hang or I/O failures or other

reasons, we take the voting file(s) offline. This thread monitors the progress

of the disk ping threads. If CSS is unable to read a majority of the voting

files, it is possible that it no longer shares access to at least one disk with

each other node. It would be possible for this node to miss an eviction

notice; in other words, CSS is not able to cooperate and must be

terminated.




16

1.4.3 Other Threads – Occasionally

– Node Kill threads – (transient) used for killing nodes via IPMI

–

Member kill thread – (transient) used during member kill

o member-kill (monitor) thread

o local-kill thread - when a CSS client initiates a member kill, the local CSS kill

thread will be created

– skgxn monitor (skgxnmon only present with vendor clusterware)

o This thread registers as a member of the node group with skgxn and

watches for changes in node-group membership. When a reconfig event

occurs, this thread requests the current node-group membership bitmap

from skgxn and compares it to the bitmap it received last time and the

current values of two other bitmaps: eviction pending, which identifies

nodes that are in the process of going down, and VMON’s group

membership, which indicates nodes whose oclsmon process is still running

(nodes that are (still) up). When a membership transition is identified, the

node-monitor thread initiates the appropriate action.

1.4.4 Other CSS trivia

In Oracle Clusterware 11g release 2 (11.2) there are diminished configuration requirements,

meaning nodes are added back automatically when started and deleted if they have been

down for too long. Unpinned servers that stop for longer than a week are no longer

reported by olsnodes. These servers are automatically administered when they leave the

cluster, so you do not need to explicitly remove them from the cluster.

1.4.4.1 Pinning nodes

The appropriate command to change the node pin behavior (i.e. to pin or unpin any specific

node), is the crsctl pin/unpin css command. Pinning a node means that the association of a

node name with a node number is fixed. If a node is not pinned, its node number may

change if the lease expires while it is down. The lease of a pinned node never expires.

Deleting a node with the crsctl delete node command implicitly unpins the node.

– During upgrade of Oracle Clusterware, all servers are pinned, whereas after a fresh

installation of Oracle Clusterware 11g release 2 (11.2), all servers you add to the

cluster are unpinned.

– You cannot unpin a server that has an instance of Oracle RAC that is older than

Oracle Clusterware 11g release 2 (11.2) if you installed Oracle Clusterware 11g

release 2 (11.2) on that server.




17

Pinning a node is required for rolling upgrade to Oracle Clusterware 11g release 2 (11.2) and

will be done automatically. We have seen cases where customers perform a manual

upgrade and this would fail due to unpinned nodes.

1.4.4.2 Port assignment

The fixed port assignment for the CSS and node monitor has been removed, so there should

be no contention with other applications for ports. The only exception is during rolling

upgrade where we assign two fixed ports.

1.4.4.3 GIPC

The CSS layer is using the new communication layer Grid IPC (GIPC) and it still supports the

interaction with the pre-11.2 CLSC communication layer. In 11.2.0.2, GIPC will support the

use of multiple NICs for a single communications link, e.g. CSS/NM internode

communications.

1.4.4.4 Cluster alert.log

More cluster_alert.log messages have been added to allow faster location of entries

associated with a problem. An identifier will be printed in both the alert.log and the daemon

log entries that are linked to the problem. The identifier will be unique within the

component, e.g. CSS or CRS.

2009-11-24 03:46:21.110

[crsd(27731)]CRS-2757:Command 'Start' timed out waiting for response from the

resource 'ora.stnsp006.vip'. Details at (:CRSPE00111:) in

/scratch/grid_home_11.2/log/stnsp005/crsd/crsd.log.

2009-11-24 03:58:07.375

[cssd(27413)]CRS-1605:CSSD voting file is online: /dev/sdj2; details in

/scratch/grid_home_11.2/log/stnsp005/cssd/ocssd.log .

1.4.4.5 Exclusive mode

A new concept in Oracle Clusterware 11g release 2 (11.2) is the clusterware exclusive mode.

This mode will allow you to start the stack on one node with virtually nothing required to

start the stack. No voting files are required and no network connectivity is required. This

mode is for maintenance or trouble shooting only. Because this is a user invoked command

the user should be sure that only one node is up at the same time. The command to start

the stack in this mode is crsctl start crs –excl which only the root user can execute and

should run from one node only.

If another node is already up in the cluster, the exclusive start up will fail. The ocssd daemon

checks for active nodes and if it finds one, the start up will fail with CRS-4402. This is not an




18

error; this is an expected behaviour when another node is already up. John Leys said “do

not file bugs because you receive CRS-4402”.

1.4.4.6 Voting file discovery

The method of identifying voting files has changed in 11.2. While voting files were

configured in OCR in 11.1 and earlier, in 11.2 voting files are located via the CSS voting file

discovery string in the GPNP profile. Examples:

1.4.4.6.1 CSS voting file discovery string referring to ASM

The CSS voting file discovery string refers to ASM, so it will be using the value in the ASM

discovery string. Most commonly you will see this configuration on systems (e.g. Linux, using

older 2.6 kernels) where raw devices can still be configured, and where raw devices are used

for the LUN’s to be used by CRS and ASM.

Example:

<orcl:CSS-Profile id="css"

DiscoveryString=”+asm"

LeaseDuration="400"/>

<orcl:ASM-Profile id="asm" DiscoveryString="" SPFile=""/>

The empty value for the ASM discovery string means that it will revert to an OS-specific

default, which on Linux is “/dev/raw/raw*”.

1.4.4.6.2 CSS voting file discovery string referring to list of LUN’s/disks

In the example below, the CSS voting file discovery string actually refers to a list of disks /LUN’s. This is likely the configuration when block devices or devices in non-default locations

are used. In that scenario, the values for the CSS VF discovery string and the ASM discovery

string are identical.

<orcl:CSS-Profile id="css"

DiscoveryString=”/dev/shared/sdsk-a[123]-*-part8"

LeaseDuration="400"/>

<orcl:ASM-Profile id="asm"

DiscoveryString=”/dev/shared/sdsk-a[123]-*-part8

SPFile=""/>

Several voting file identifiers must be found on a disk to accept it as a voting disk: a unique

identifier for the file, the cluster GUID and a matching configuration incarnation number

(CIN). vdpatch can be used (vdpatch) to inspect a device whether it is a voting file.




19

1.4.5 CSS lease

Lease acquisition is a mechanism through which a node acquires a node number. A lease

denotes that a node owns the associated node number for a period defined by the lease

duration. A lease duration is hardcoded in the GPNP profile to be one week. A node owns

the lease for the lease duration from the time of last lease renewal. A lease is considered to

be renewed during every DHB. Hence a lease expiry is defined

as below - lease expiry time = last DHB time + lease duration.

There are two types of lease.

– Pinned leases

A node uses a hard coded static node number. A pinned lease is used in an upgrade

scenario which involves older version clusterware that use static node number.

– Unpinned leases

A node acquires a node number dynamically using a lease acquisition algorithm.Lease acquisition algorithm is designed to resolve conflicts among nodes which try

to acquire the same slot at the same time.

For a successful lease operation the below message is put into the

Grid_home/log/<hostname>/alert<hostname>.log,

[cssd(8433)]CRS-1707:Lease acquisition for node staiv10 number 5 completed

For a lease acquisition failure, an appropriate message is also put in the <alert>hostname.log

and the ocssd.log. In the current release there are no tunable to tune the lease duration.

1.4.6 Split Brain Resolution

The below chapter will describe the main components and techniques used to resolve split

brain situations.

1.4.6.1 Heartbeats

The CSS uses two main heartbeat mechanisms for cluster membership, the network

heartbeat (NHB) and the disk heartbeat (DHB). The heartbeat mechanisms are intentionally

redundant and they are used for different purposes. The NHB is used for the detection of

loss of cluster connectivity, whereas the DHB is mainly used for network split brain

resolution. Each cluster node must participate in the heartbeat protocols in order to be

considered a healthy member of the cluster.

1.4.6.1.1 Network Heartbeat (NHB)

The NHB is sent over the private network interface that was configured as private

interconnect during Clusterware installation. CSS sends a NHB every second from one node




20

to all the other nodes in a cluster and receive every second a NHB from the remote nodes.

The NHB is also sent to the cssdmonitor and the cssdagent.

The NHB contains time stamp information from the local node, and is used by the remotenode to figure out when the NHB was sent. It indicates that a node can participate in cluster

activities, e.g. group membership changes, message sends etc. If the NHB is missing for

<misscount> seconds (30 seconds in Linux 11.2), a cluster membership change (cluster

reconfiguration) is required. The loss of connectivity to the network is not necessarily fatal if

the network connectivity is restored in less then <misscount> seconds.

To debug NHB issues, it is sometimes useful to increase the ocssd log level to 3 to see each

heartbeat message. Run the crsctl set log command as root user on each node:

# crsctl set log css ocssd:3

Monitor the largest misstime value in milliseconds to see if the misscount is increasing,

which would indicate network problems.

# tail -f ocssd.log | grep -i misstime

2009-10-22 06:06:07.275: [ ocssd][2840566672]clssnmPollingThread: node 2,

stnsp006, ninfmisstime 270, misstime 270, skgxnbit 4, vcwmisstime 0, syncstage 0

2009-10-22 06:06:08.220: [ ocssd][2830076816]clssnmHBInfo: css timestmp

1256205968 220 slgtime 246596654 DTO 28030 (index=1) biggest misstime 220 NTO

28280





28290





28290

To display the value of the current misscount setting use the command crsctl get css

misscount . We do not support a misscount setting other than the default; for customerswith more stringent HA requirements, contact Support / Development.




21

1.4.6.1.2 Disk Heartbeat (DHB)

Apart from the NHB, we use the DHB which is required for split brain resolution. It contains a

timestamp of the local time in UNIX epoch seconds, as well as a millisecond timer.

The DHB beat is the definitive mechanisms to make a decision about whether a node is still

alive. When the DHB beat is missing for too long, the node is assumed to be dead. When

connectivity to the disk is lost for 'too long', the disk is considered offline.

The definition about ‘too long’ depends for the DHB on the following circumstances. First of

all, the Long disk I/O Timeout (LIOT), which has a default setting from 200 seconds. If we

cannot finish an I/O within that time to a voting file, we will take this voting file offline.

Secondly, the Short disk I/O Timeout (SIOT), which CSS uses during a cluster reconfiguration.

The SIOT is related to misscount (misscount (30) – reboottime (3) = 27 sec.). The default

reboottime is 3 seconds. To display the value of the disktimeout parameter for CSS, use the

command, crsctl get css disktimeout.

1.4.6.2 Network Split Detection

The timestamp of the last NHB is compared to the timestamp of the most recent DHB to

determine if a node is still alive.

When the delta between the timestamps of the most recent DHB and the last NHB is greater

than the SIOT (misscount – reboottime), a node is considered still active.

When the delta between the timestamps is less than reboottime, the node is considered still

alive.

If the time that the last DHB was read is more than SIOT, the node is considered dead (see

bug 5949311) .If the delta between the timestamps is greater than reboottime and less than SIOT, the

status of the node is unclear, and we must wait to make a decision until we fall into one of

the three above categories.

When the network fails and nodes that are still up cannot communicate with each other, the

network is considered split. To maintain data integrity when a split occurs, one of the nodes

must fail and the surviving nodes should be an optimal sub-cluster of the original cluster.

Nodes that are not to survive are evicted via one of the three possible ways:

– Via an eviction message sent through the network. In most cases this will fail

because of the existing network failure.

– via the voting file, the kill block

– via IPMI, if supported and configured

To explain this in more detail we use the following example for a cluster with nodes A, B, C

and D:




22

– Nodes A and B receive each other's heartbeats

– Nodes C and D receive each other's heartbeats

– Nodes A and B cannot see heartbeats of C or D

– Nodes C and D cannot see heartbeats of A or B

– Nodes A and B are one cohort, C and D are another cohort

– Split begins when 2 cohorts stop receiving NHB’s from each other

CSS assumes a symmetric failure, i.e. the cohort of A+B stops receiving NHB’s from the

cohort of C+D at the same time that C+D stop receiving NHB’s from A+B.

In scenarios like this, CSS uses the voting file and DHB for split brain resolution. The kill

block, which is one part of the voting file structure, will be updated and used to notify nodes

that they have been evicted. Each node is reading its kill block every second, and will commit

suicide after another node has updated this kill block section.

In cases like the above, where we have similar sized sub-clusters, the sub-cluster with the

node containing the lower node number will survive and the other sub-cluster nodes will

reboot.

In case of a split in a larger cluster, the bigger sub-cluster will survive. In the two-node

cluster case, the node with the lower node number will survive in case of a network split,

independent from where the network error occurred.

The connectivity to a majority of voting files required for a node to stay active.

1.4.7 Member Kill ArchitectureThe kill daemon in 11.2.0.1 is an unprivileged process that kills members of CSS groups. It is

spawned by the ocssd library code when an I/O capable client joins a group, and it is

respawned when required. There is ONE kill daemon (oclskd) per user (e.g. crsowner,

oracle).

1.4.7.1 Member kill description

The following ocssd threads are involved in member kill / member kill escalation:

– client_listener – receives group join and kill requests

– peer_listener – receives kill requests from remote nodes

– death_check – provides confirmation of termination

– member_kill – spawned to manage a member kill request

– local_kill – spawned to carry out member kills on local node




23

– node termination – spawned to carry out escalation

Member kills are issued by clients who want to eliminate group members doing IO, for

example:

– LMON of the ASM instance

– LMON of a database instance

– crsd on Policy Engine (PE) master node (new in 11.2)

Member kills always involve a remote target; either a remote ASM or database instance, or

a remote, non-PE master crsd. The member kill request is handed over to the local ocssd,

who then sends the request to ocssd on the target node. In 11.1 and 11.2.0.1, ocssd will

hand over the process id's of the primary and shared members of the group to be killed to

oclskd. The oclskd will then perform a kill -9 on these processes. In 11.2.0.2 and later, the killdaemon runs as a thread in the cssdagent and cssdmonitor processes, hence there will no

running oclskd.bin process anymore. The kill daemon / thread register with CSS separately in

the KILLD group.

In some situations, and more likely in 11.2.0.1 and earlier, such as extreme CPU and memory

starvation, the remote node's kill daemon or remote ocssd cannot service the local ocssd’s

member kill request in time (misscount seconds), and therefore the member kill request will

time out. If LMON (ASM and/or RDBMS) requested the member kill, then the request will be

escalated by the local ocssd to a remote node kill. A member kill request by crsd will never

be escalated to a node kill, instead, we rely on the orarootagent's check action to detect the

dysfunctional crsd and restart it. The target node's ocssd will receive the member killescalation request, and will commit suicide, thereby forcing a node reboot.

With the kill daemon running as real-time thread in cssdagent/cssdmonitor (11.2.0.2),

there's a higher chance that the kill request succeeds despite high system load.

If IPMI is configured and functional, the ocssd node monitor will spawn a node termination

thread to shutdown the remote node using IPMI. The node termination thread

communicates with the remote BMC via the management LAN; it will establish an

authentication session (only a privileged user can shutdown a node) and check the power

status. The next step is requesting is a power-off and repeatedly checking the status until

the node status is OFF. After receiving the OFF status, we will power-ON the remote node

again, and the node termination thread will exit.

1.4.7.2 Member kill example:

LMON of database instance 3 issuing a member kill for instance on node 2 due to CPU

starvation:




24

2009-10-21 12:22:03.613810 : kjxgrKillEM: schedule kill of inst 2 inc 20

in 20 sec

2009-10-21 12:22:03.613854 : kjxgrKillEM: total 1 kill(s) scheduled

kgxgnmkill: Memberkill called - group: DBPOMMI, bitmap: 1

2009-10-21 12:22:22.151: [ CSSCLNT]clssgsmbrkill: Member kill request: Members map

0x00000002

2009-10-21 12:22:22.152: [ CSSCLNT]clssgsmbrkill: Success from kill call rc 0

The local ocssd (third node, internal node number 2) receives the member kill request:

2009-10-21 12:22:22.151: [ ocssd][2996095904]clssgmExecuteClientRequest: Member

kill request from client (0x8b054a8)

2009-10-21 12:22:22.151: [ ocssd][2996095904]clssgmReqMemberKill: Kill

requested map 0x00000002 flags 0x2 escalate 0xffffffff

2009-10-21 12:22:22.152: [ ocssd][2712714144]clssgmMbrKillThread: Kill

requested map 0x00000002 id 1 Group name DBPOMMI flags 0x00000001 start time

0x91794756 end time 0x91797442 time out 11500 req node 2

DBPOMMI is the database group where LMON registers as primary member

time out = misscount (in milliseconds) + 500ms

map = 0x2 = 0010 = second member = member 1 (other example: map = 0x7 = 0111 =

members 0,1,2)

The remote ocssd on the target node (second node, internal node number 1) receives the

request and submits the PID's to the kill daemon:

2009-10-21 12:22:22.201: [ ocssd][3799477152]clssgmmkLocalKillThread: Local

kill requested: id 1 mbr map 0x00000002 Group name DBPOMMI flags 0x00000000 st

time 1088320132 end time 1088331632 time out 11500 req node 2

2009-10-21 12:22:22.201: [ ocssd][3799477152]clssgmmkLocalKillThread: Kill

requested for member 1 group (0xe88ceda0/DBPOMMI)

2009-10-21 12:22:22.201: [ ocssd][3799477152]clssgmUnreferenceMember: global

grock DBPOMMI member 1 refcount is 7

2009-10-21 12:22:22.201: [ ocssd][3799477152]GM Diagnostics started for

mbrnum/grockname: 1/DBPOMMI

2009-10-21 12:22:22.201: [ ocssd][3799477152]group DBPOMMI, member 1 (client

0xe330d5b0, pid 23929)


0xe331fd68, pid 23973) sharing group DBPOMMI, member 1, share type normal

2009-10-21 12:22:22.201: [ ocssd][3799477152]group DG_LOCAL_POMMIDG, member 0

(client 0x89f7858, pid 23957) sharing group DBPOMMI, member 1, share type xmbr




25


0x8a1e648, pid 23949) sharing group DBPOMMI, member 1, share type normal


0x89e7ef0, pid 23951) sharing group DBPOMMI, member 1, share type normal


0xe8aabbb8, pid 23947) sharing group DBPOMMI, member 1, share type normal


(client 0x8a23df0, pid 23949) sharing group DG_LOCAL_POMMIDG, member 0, share type

normal


(client 0x8a25268, pid 23929) sharing group DG_LOCAL_POMMIDG, member 0, share type

normal


(client 0x89e9f78, pid 23951) sharing group DG_LOCAL_POMMIDG, member 0, share type

normal


(client 0xe8ab5cc0, pid 23947) sharing group DG_LOCAL_POMMIDG, member 0, share

type normal

2009-10-21 12:22:22.202: [ ocssd][3799477152]GM Diagnostics completed for

mbrnum/grockname: 1/DBPOMMI

2009-10-21 12:22:22.202: [ ocssd][3799477152]clssgmmkLocalSendKD: Copy pid

23929


23973


23957


23949


23951


23947


23949


23929


23951


23947




26

At this point, the oclskd.log should indicate the successful kill of these processes, and

thereby the completion of the kill request. In 11.2.0.2 and later, the kill daemon thread will

perform the kill:

2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsnkillagent_main:killreq

received:

2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0

pid 23929


pid 23973


pid 23957


pid 23949


pid 23951


pid 23947


pid 23949


pid 23929


pid 23951


pid 23947

However, if within (misscount + 1/2 seconds) the request doesn't complete, the ocssd on the

local node escalates the request to a node kill:

2009-10-21 12:22:33.655: [ ocssd][2712714144]clssgmMbrKillThread: Time up:

Start time -1854322858 End time -1854311358 Current time -1854311358 timeout 11500

2009-10-21 12:22:33.655: [ ocssd][2712714144]clssgmMbrKillThread: Member kill

request complete.

2009-10-21 12:22:33.655: [ ocssd][2712714144]clssgmMbrKillSendEvent: Missing

answers or immediate escalation: Req member 2 Req node 2 Number of answers

expected 0 Number of answers outstanding 1

2009-10-21 12:22:33.656: [ ocssd][2712714144]clssgmQueueGrockEvent:

groupName(DBPOMMI) count(4) master(0) event(11), incarn 0, mbrc 0, to member 2,

events 0x68, state 0x0

2009-10-21 12:22:33.656: [ ocssd][2712714144]clssgmMbrKillEsc: Escalating node




27

1 Member request 0x00000002 Member success 0x00000000 Member failure 0x00000000

Number left to kill 1

2009-10-21 12:22:33.656: [ ocssd][2712714144]clssnmKillNode: node 1 (staiu02)

kill initiated

2009-10-21 12:22:33.656: [ ocssd][2712714144]clssgmMbrKillThread: Exiting

ocssd on the target node will abort, forcing a node reboot:

2009-10-21 12:22:33.705: [ ocssd][3799477152]clssgmmkLocalKillThread: Time up.

Timeout 11500 Start time 1088320132 End time 1088331632 Current time 1088331632

2009-10-21 12:22:33.705: [ ocssd][3799477152]clssgmmkLocalKillResults: Replying

to kill request from remote node 2 kill id 1 Success map 0x00000000 Fail map

0x00000000

2009-10-21 12:22:33.705: [ ocssd][3799477152]clssgmmkLocalKillThread: Exiting

...

2009-10-21 12:22:34.679: [

ocssd][3948735392](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node 2,

sync 151438398, stamp 2440656688

2009-10-21 12:22:34.679: [

ocssd][3948735392]###################################

2009-10-21 12:22:34.679: [ ocssd][3948735392]clssscExit: ocssd aborting from

thread clssnmvKillBlockThread

2009-10-21 12:22:34.679: [

ocssd][3948735392]###################################

1.4.7.3 How to identify the client who originally requested the member kill?

From the ocssd.log, the requestor can also be derived:

2009-10-21 12:22:22.151: [ocssd][2996095904]clssgmExecuteClientRequest: Member

kill request from client (0x8b054a8)

<search backwards to when client registered>

2009-10-21 12:13:24.913: [ocssd][2996095904]clssgmRegisterClient:

proc(22/0x8a5d5e0), client(1/0x8b054a8)

<search backwards to when process connected to ocssd>

2009-10-21 12:13:24.897: [ocssd][2996095904]clssgmClientConnectMsg: Connect from

con(0x677b23) proc(0x8a5d5e0) pid(20485/20485) version 11:2:1:4, properties:

1,2,3,4,5

Using 'ps', or from other history (e.g. trace file, IPD/OS, OSWatcher), the process can be




28

identified via the process id:

$ ps -ef|grep ora_lmon

spommere 20485 1 0 01:46 ? 00:01:15 ora_lmon_pommi_3

1.4.8 Intelligent Platform Management Interface (IPMI)

Intelligent Platform Management Interface (IPMI) is an industry standard management

protocol that is included with many servers today. IPMI operates independently of the

operating system, and can operate even if the system is not powered on. Servers with IPMI

contain a baseboard management controller (BMC) which is used to communicate to the

server.

1.4.8.1 About Using IPMI for Node Fencing

To support the member-kill escalation to node-termination, you must configure and use an

external mechanism capable of restarting a problem node without cooperation, either from

Oracle Clusterware or from the operating system running on that node. IPMI is such a

mechanism, supported starting with 11.2. Normally, node termination using IPMI is

configured during installation, when the option of configuring IPMI from the Failure Isolation

Support screen is provided. If IPMI is not configured during installation, then it can be

configured using crsctl after the installation of CRS is complete.

1.4.8.2 About Node-termination Escalation with IPMI

To use IPMI for node termination, each cluster member node must be equipped with a

Baseboard Management Controller (BMC) running firmware compatible with IPMI version

1.5, which supports IPMI over a local area network (LAN). During database operation,member-kill escalation is accomplished by communication from the evicting ocssd daemon

to the victim node’s BMC over LAN. The IPMI over LAN protocol is carried over an

authenticated session protected by a user name and password, which are obtained from the

administrator during installation. If the BMC IP addresses are DHCP assigned, ocssd requires

direct communication with the local BMC during CSS startup. This is accomplished using a

BMC probe command (OSD), which communicates with the BMC through an IPMI driver,

which must be installed and loaded on each cluster system.

1.4.8.3 OLR Configuration for IPMI

There are two ways to configure IPMI, either during the Oracle Clusterware installation via

the Oracle Universal Installer or afterwards via crsctl.

OUI – asks about node-fencing via IPMI

– tests for driver to enable full support (DHCP addresses)

– obtains IPMI username and password and configures OLR on all cluster nodes




29

Manual configuration - after install or when using static IP addresses for BMCs

– crsctl query css ipmidevice

– crsctl set css ipmiadmin <ipmi-admin>

– crsctl set css ipmiaddr

See Also: Oracle Clusterware Administration and Deployment Guide, “Configuration and

Installation for Node Fencing" for more information and Oracle Grid Infrastructure

Installation Guide, “Enabling Intelligent Platform Management Interface (IPMI)”

1.4.9 Debugging CSS

Sometimes it is necessary to change the default logging level for ocssd.

The default logging level for ocssd in 11.2 is 2. In order to change the logging level, run the

following command as root user on a node with the clusterware stack up:

# crsctl set log css CSSD:N (where N is the logging level)

– Logging level 2 = default

– Logging level 3 = verbose e.g. displays each heartbeat message including the

misstime which can be helpful debugging NHB related problems

– Logging level 4 = super verbose

Most problems can be solved with level 2. Some require level 3, few require level 4. Using

level 3 or 4, trace information may only be kept for a few hours (or even minutes) because

the trace files can fill up and information can be overwritten. Please note that a high logging

level will incur a performance impact on ocssd due to the amount of tracing. If you need tokeep data for a longer period of time, create a cron job to back up and compress the CSS

logs.

In order to trace the cssdagent or the cssdmonitor the below enhanced tracing can be set

via crsctl.

# crsctl set log res ora.cssd=2 -init

# crsctl set log res ora.cssdmonitor=2 -init

In Oracle Clusterware 11g release 2 (11.2), CSS prints the stack dump into the cssdOUT.log.

There are enhancements which will help to flush diagnostic data to disk before a reboot

occurs. So in 11.2 we don’t consider it necessary to change the diagwait (default 0) unless

advised by support or development.

In very rare cases and only during debugging, it might maybe necessary to disable ocssd

reboots. This can be done via below crsctl command. Disabling reboots should only be done

when instructed by support or development and can be done online without a clusterware

stack restart.




30

# crsctl modify resource ora.cssd -attr "ENV_OPTS=DEV_ENV" -init

# crsctl modify resource ora.cssdmonitor -attr "ENV_OPTS=DEV_ENV" –init

Starting with 11.2.0.2 the possibility to set higher log levels for the individual modules is

introduced.

To list all the module names for the css daemon, the following command should be used:

# crsctl lsmodules css

List CSSD Debug Module: CLSF

List CSSD Debug Module: CSSD

List CSSD Debug Module: GIPCCM

List CSSD Debug Module: GIPCGM

List CSSD Debug Module: GIPCNM

List CSSD Debug Module: GPNPList CSSD Debug Module: OLR

List CSSD Debug Module: SKGFD

CLSF and SKGFD - are related to the I/O layer to the voting disks

CSSD - same old one

GIPCCM - gipc communication between applications and CSS

GIPCGM - communication between peers in the GM layer

GIPCNM - communication between nodes in the NM layer

GPNP - trace for gpnp calls within CSS

OLR - trace for olr calls within CSS

The following is an example in how to set the trace level different for various modules.

# crsctl set log css GIPCCM=1,GIPCGM=2,GIPCNM=3

# crsctl get log css CSSD=4

To check which trace level is currently set the following command can be used:

# crsctl get log ALL

# crsctl get log css GIPCCM

1.4.10 CSSDAGENT and CSSDMONITOR

The cssdagent and cssdmonitor provide almost the same functionality. The cssdagent(represented by the ora.cssd resource) starts, stops, and checks the status of the ocssd

daemon. The cssdmonitor (represented by the ora.cssdmonitor resource) monitors the

cssdagent. There is no ora.cssdagent resource, and there is no resource for the ocssd

daemon.




31

Both agents implement the functionality of several pre-11.2 daemons such as the oprocd,

and olcsomon; the thread that implements oclsvmon functionality, runs in either process,

not both. The cssdagent and cssdmonitor run in real-time priority with locked down

memory, just like ocssd.

In addition, the cssdagent and cssdmonitor provide the following services to guarantee data

integrity:

– Monitoring ocssd; if ocssd fails, then cssd* reboot the node.

– Monitoring the node scheduling: if node is hung / not scheduled, reboot the node.

To make more comprehensive decisions whether a reboot is required, both cssdagent and

cssdmonitor receive state information from ocssd, via NHB, to ensure that the state of the

local nodes as perceived by remote nodes is accurate. Furthermore, the integration will

leverage the time before other nodes perceive the local node to be down for purposes such

as filesystem sync to get complete diagnostic data.

1.4.10.1 CSSDAGENT and CSSDMONITOR debugging

In order to enable ocssd agent debugging, the command crsctl set log res ora.cssd:3 –init

should be used. The operation is logged in the

Grid_home/log/<hostname>/agent/ohasd/oracssdagent_root/oracssdagent_root.log and

immediate more trace information is written to the oracssdagent_root.log.

2009-11-25 10:00:52.386: [ AGFW][2945420176] Agent received the message:

RESOURCE_MODIFY_ATTR[ora.cssd 1 1] ID 4355:106099

2009-11-25 10:00:52.387: [ AGFW][2966399888] Executing command:

res_attr_modified for resource: ora.cssd 1 1

2009-11-25 10:00:52.387: [ USRTHRD][2966399888] clsncssd_upd_attr: setting trace

to level 3

2009-11-25 10:00:52.388: [ CSSCLNT][2966399888]clssstrace: trace level set to 2

2009-11-25 10:00:52.388: [ AGFW][2966399888] Command: res_attr_modified for

resource: ora.cssd 1 1 completed with status: SUCCESS

2009-11-25 10:00:52.388: [ AGFW][2945420176] Attribute: LOGGING_LEVEL for

resource ora.cssd modified to: 3

2009-11-25 10:00:52.388: [ AGFW][2945420176] config version updated to : 7 for

ora.cssd 1 1

2009-11-25 10:00:52.388: [ AGFW][2945420176] Agent sending last reply for:

RESOURCE_MODIFY_ATTR[ora.cssd 1 1] ID 4355:106099

2009-11-25 10:00:52.484: [ CSSCLNT][3031063440]clssgsgrpstat: rc 0, gev 0, incarn

2, mc 2, mast 1, map 0x00000003, not posted

The same applies for the cssdmonitor (ora.cssdmonitor) resource.




32

1.4.11 Concepts

1.4.11.1 HEARTBEATS

– Disk HeartBeat (DHB) is written to the voting file periodically, once per second

– Network HeartBeat (NHB) is sent to the other nodes periodically, once per second

– Local HeartBeat (LHB) is sent to the agent/monitor periodically, once per second

1.4.11.2 ocssd threads

– Sending Thread (ST) sends NHB’s and LHB’s (at the same time) – Disk Ping thread writes DHB’s to VF (one per VF) – Cluster Listener (CLT) receive messages from other nodes, mostly NHB’s

1.4.11.3 Agent/Monitor threads– HeartBeat thread (HBT) receives LHB from ocssd and detects connection failures

– OMON thread (OMT) monitors for connection failure and state of its local peer

– OPROCD thread (OPT) monitors scheduling of agent/monitor processes

– VMON thread (VMT) replaces clssvmon executable, registers in skgxn group when

vendor clusterware present

1.4.11.4 Timeouts

– Misscount (MC) amount of time with no NHB from a node before removing the

node from the cluster– Network Time Out (NTO) maximum time remaining with no NHB from a node

before removing the node from the cluster

– Disk Time Out (DTO) maximum time left before a majority of voting files are

considered inaccessible

– ReBoot Time (RBT) the amount of time allowed for a reboot; historically had to

account for init script latencies in rebooting. The default is 3 seconds.

1.4.11.5 Misscount, SIOT, RBT

– Disk I/O Timeout amount of time for a voting file to be offline before it is unusable

o SIOT – Short I/O Timeout, in effect during reconfig

o LIOT – Long I/O Timeout, in effect otherwise

– Long I/O Timeout – (LIOT) is configurable via ‘crsctl set css disktimeout’ and the

default is 200 seconds




33

– Short I/O Timeout (SIOT) is (misscount – reboot time) o In effect when NHB’s missed for misscount/2

o ocssd terminates if no DHB for SIOT

o Allows RBT seconds after termination for reboot to complete

1.4.11.6 Disk Heartbeat Perceptions

– Other node perception of local state in reconfig

o No NHB for misscount, node not visible on network

o No DHB for SIOT, node not alive

o If node alive, wait full misscount for DHB activity to be missing, i.e. node

not alive

– As long as DHB’s are written, other nodes must wait

– Perception of local state by other nodes must be valid to avoid data corruption

1.4.11.7 Disk Heartbeat Relevance

– DHB only read starting shortly before a reconfig to remove the node is started

– When no reconfig is impending, the I/O timeout not important, so need not be

monitored

– If the disk timeout expires, but the NHB’s have been sent to and received from

other nodes, it will still be misscount seconds before other nodes will start a

reconfig

– The proximity to a reconfig is important state information for OPT

1.4.11.8 Clocks

– Time Of Day Clock (TODC) the clock that indicates the hour/minute/second of the

day (may change as a result of commands) – aTODC is the agent TODC

– cTODC is the ocssd TODC

– Invariant Time Clock (ITC) a monotonically increasing clock that is invariant i.e. does

not change as a result of commands). The invariant clock does not change if time setbackwards or forwards; it is always constant.

o aITC is the agent ITC

o cITC is the ocssd ITC




34

1.4.12 How It Works

ocssd state information contains the current clock information, the network time out (NTO)

based on the node with the longest time since the last NHB and a disk I/O timeout based on

the amount of time since the majority of voting files was last online. The sending thread

gathers this current state information and sends both a NHB and local heartbeat to ensure

that the agent perception of the aliveness of ocssd is the same as that of other nodes.

The cluster listener thread monitors the sending thread. It ensures the sending thread has

been scheduled recently and wakes up if necessary. There are enhancements here to ensure

that even after clock shifts backwards and forwards, the sending thread is scheduled

accurately.

There are several agent threads, one is the oprocd thread which just sleeps and wakes up

periodically. Upon wakeup, it checks if it should initiate a reboot, based on the last known

ocssd state information and the local invariant time clock (ITC). The wakeup is timer driven.

The heartbeat thread is just waiting for a local heartbeat from the ocssd. The heartbeatthread will calculate the value that the oprocd thread looks at, to determine whether to

reboot. It checks if the oprocd thread has been awake recently and if not, pings it awake.

The heartbeat thread is event driven and not timer driven.

1.4.13 Filesystem Sync

When the ocssd fails, a filesystem sync is started. There is a fair amount of time to get this

done, so we can wait several seconds for a sync. The last local heartbeat indicates how long

we can wait, and the wait time is based on misscount. When the wait time expires, oprocd

will reboot the node. In most cases, diagnostic data will get written to disk. There are rare

cases when this may not possible, e.g. when the sync is not issued due to CSS being hung.




35

1.5 Cluster Ready Services (CRS):

Cluster Ready Services is the primary program for managing high availability operations in a

cluster. The CRS daemon (crsd) manages cluster resources based on the configurationinformation that is stored in OCR for each resource. This includes start, stop, monitor, and

failover operations. The crsd daemon monitors the Oracle database instance, listener, and

so on, and automatically restarts these components when a failure occurs.

The crsd daemon runs as root and restarts automatically after a failure. When Oracle

Clusterware is installed in a single-instance database environment for Oracle ASM and

Oracle Restart, ohasd instead of crsd manages application resources.

1.5.1 Policy Engine

1.5.1.1 Overview

Resource High Availability in 11.2 is handled by OHASD (usually for infrastructure resources)and CRSD (for applications deployed in the cluster). Both daemons share the same

architecture and most of the code base. For most intents and purposes, OHASD can be seen

as a CRSD in a cluster of one node. The discussion in the subsequent sections applies to both

daemons, to the extent it makes sense (“OHASD is like a CRSD in a single node cluster!”)

Since 11.2, the architecture of CRSD implements the master-slave model: a single CRSD in

the cluster is picked to be the master and others are all slaves. Upon daemon start-up and

every time the master is re-elected, every CRSD writes the current master into its crsd.log

(grep for “PE MASTER NAME”) e.g.

grep "PE MASTER" Grid_home/log/hostname/crsd/crsd.*

crsd.log:2010-01-07 07:59:36.529: [ CRSPE][2614045584] PE MASTER NAME: staiv13

CRSD is a distributed application comprised of several “modules”. Modules are mostly state-less

and operate by exchanging messages. The state (context) is always carried with each individual

message; most interactions are asynchronous in nature. Some modules have dedicated threads

others share a single thread and some operate with a pool of threads. The important CRSD

modules are as follows:

- The Policy Engine (a.k.a PE/CRSPE in logs) is responsible for rendering all policy decisions

- The Agent Proxy Server (a.k.a Proxy/AGFW in logs) is responsible for agent management

and proxy-ing commands/events between the Policy Engine and the agents

- The UI Server (a.k.a UI/UiServer in logs) is responsible for managing client connections

(APIs/crsctl), and being a proxy between the PE and client programs

- The OCR/OLR module (OCR in logs) is the front-end for all OCR/OLR interactions

- The Reporter module (CRSRPT in logs) is responsible for all event publishing out of CRSD




36

For example, a client request to modify a resource will produce the following interaction:

CRSCTL UI Server PE OCR Module PE Reporter (event publishing)

Proxy (to notify the agent)

CRSCTL UI Server PE

Note that the UiServer/PE/Proxy can each be on different nodes, as shown on Figure 4 below.

Figure 4: UiServer / PE / Proxy picture

1.5.1.2 Resource Instances & IDs

In 11.2, CRS modeling supports two concepts of resource multiplicity: cardinality and

degree. The former controls the number of nodes where the resource can run concurrently

while the latter controls the number of instances of the resource that can be run on each

node. To support the concepts, the PE now distinguishes between resources and resource

instances. The former can be seen as a configuration profile for the entire resource while the

latter represents the state data for each instance of the resource. For example, a resource

with CARDINALITY=2, DEGREE=3 will have 6 resource instances. Operations that affect

resource state (start/stopping/etc.) are performed using resource instances. Internally,

resource instances are referred to with IDs which following the following format: “<A> <B>

<C>” (note space separation), where <A> is the resource name, <C> is the degree of the

instance (mostly 1), and <B> is the cardinality of the instance for cluster_resource resources

or the name of the node to which the instance is assigned for local_resource names. That’s

why resource name have “funny” decorations in logs:

[ CRSPE][2660580256] {1:25747:256} RI [r1 1 1] new target state: [ONLINE] old

value: [OFFLINE]

crsctl

CRSD

@ Node 2

1

2CRSD

@ Node 1

CRSD

@ Node 0

3

agent

4 5

67

8




37

1.5.1.3 Log Correlation

CRSD is event-driven in nature. Everything of interest is an event/command to process. Two

kinds of commands are distinguished: planned and unplanned. The former are usually

administrator-initiated (add/start/stop/update a resource, etc.) or system-initiated

(resource auto start at node reboot, for instance) actions while the latter are normally

unsolicited state changes (a resource failure, for example). In either case, processing such

events/commands is what CRSD does and that’s when module interaction takes place. One

can easily follow the interaction/processing of each event in the logs, right from the point of

origination (say from the UI module) through to PE and then all the way to the agent and

back all the way using the concept referred to as a “tint”. A tint is basically a cluster-unique

event ID of the following format: {X:Y:Z}, where X is the node number, Y a node-unique

number of a process where the event first entered the system, and Z is a monotonically

increasing sequence number, per process. For instance, {1:25747:254} is a tint for the

254th event that originated in some process internally referred to us 25747 on node

number 1. Tints are new in 11.2.0.2 and can be seen in CRSD/OHASD/agent logs. Each event

in the system gets assigned a unique tint at the point of entering the system and modules

prefix each log message while working on the event with that tint.

For example, in a 3-node cluster where node0 is the PE, issuing a “crsctl start resource r1 –n

node2” from node1, exactly as illustrated on Figure 4 above, will produce the following in

the logs:

CRSD log node1 (crsctl always connects to the local CRSD; UI server forwards the

command to the PE):

2009-12-29 17:07:24.742: [UiServer][2689649568] {1:25747:256} Container [ Name:UI_START

…

RESOURCE:

TextMessage[r1]

2009-12-29 17:07:24.742: [UiServer][2689649568] {1:25747:256} Sending message to

PE. ctx= 0xa3819430

CRSD log node 0 (with PE master)

2009-12-29 17:07:24.745: [ CRSPE][2660580256] {1:25747:256} Cmd : 0xa7258ba8 :

flags: HOST_TAG | QUEUE_TAG

2009-12-29 17:07:24.745: [ CRSPE][2660580256] {1:25747:256} Processing PE

command id=347. Description: [Start Resource : 0xa7258ba8]

2009-12-29 17:07:24.748: [ CRSPE][2660580256] {1:25747:256} RI [r1 1 1] new

target state: [ONLINE] old value: [OFFLINE]




38

2009-12-29 17:07:24.748: [ CRSOCR][2664782752] {1:25747:256} Multi Write Batch

processing...

2009-12-29 17:07:24.753: [ CRSPE][2660580256] {1:25747:256} Sending message to

agfw: id = 2198

Here, the PE performs a policy evaluation and interacts with the Proxy on the

destination node (to issue the start action) and the OCR (to record the new value for the

TARGET).

CRSD log node 2 (The proxy starts the agent, forwards the message to it)

2009-12-29 17:07:24.763: [ AGFW][2703780768] {1:25747:256} Agfw Proxy Server

received the message: RESOURCE_START[r1 1 1] ID 4098:2198

2009-12-29 17:07:24.767: [ AGFW][2703780768] {1:25747:256} Starting the agent:

/ade/agusev_bug/oracle/bin/scriptagent with user id: agusev and incarnation:1

AGENT log node 2 (the agent executes the start command)

2009-12-29 17:07:25.120: [ AGFW][2966404000] {1:25747:256} Agent received the

message: RESOURCE_START[r1 1 1] ID 4098:1459

2009-12-29 17:07:25.122: [ AGFW][2987383712] {1:25747:256} Executing command:

start for resource: r1 1 1

2009-12-29 17:07:26.990: [ AGFW][2987383712] {1:25747:256} Command: start for

resource: r1 1 1 completed with status: SUCCESS

2009-12-29 17:07:26.991: [ AGFW][2966404000] {1:25747:256} Agent sending reply

for: RESOURCE_START[r1 1 1] ID 4098:1459

CRSD log node 2 (The proxy gets a reply, forwards it back to the PE)


received the message: CMD_COMPLETED[Proxy] ID 20482:2212


replying to the message: CMD_COMPLETED[Proxy] ID 20482:2212

CRSD log node 0 (with PE master: receives the reply, notifies the Reporter and replies to UI

Server; the Reporter publishes to EVM)

2009-12-29 17:07:27.012: [ CRSPE][2660580256] {1:25747:256} Received reply to

action [Start] message ID: 2198

2009-12-29 17:07:27.504: [ CRSPE][2660580256] {1:25747:256} RI [r1 1 1] new

external state [ONLINE] old value: [OFFLINE] on agusev_bug_2 label = []

2009-12-29 17:07:27.504: [ CRSRPT][2658479008] {1:25747:256} Sending UseEvm mesg

2009-12-29 17:07:27.513: [ CRSPE][2660580256] {1:25747:256} UI Command [Start

Resource : 0xa7258ba8] is replying to sender.




39

CRSD log node1 (where crsctl command was issued; UI server writes out the response,

completes the API call)

2009-12-29 17:07:27.525: [UiServer][2689649568] {1:25747:256} Container [ Name:

UI_DATA

r1:

TextMessage[0]

]

2009-12-29 17:07:27.526: [UiServer][2689649568] {1:25747:256} Done for

ctx=0xa3819430

The above demonstrates the ease of following distributed processing of a single request

across 4 processes on 3 nodes by using tints as a way to filter, extract, group and correlate

information pertaining to a single event across a plurality of diagnostic logs.




40

1.6 Grid Plug and Play (GPnP)

A new feature in Oracle Clusterware 11g release 2 (11.2) is Grid Plug and Play, which is

mainly managed by the Grid Plug and Play Daemon (GPnPD). The GPnPD provides access tothe GPnP profile, and coordinates updates to the profile among the nodes of the cluster to

ensure that all of the nodes have the most recent profile.

1.6.1 GPnP Configuration

The GPnP configuration is a profile and wallet configuration, identical for every peer node.

The profile and wallet are created and copied by the Oracle Universal Installer. The GPnP

profile is a XML test file which contains bootstrap information necessary to form a cluster.

Information such as the clustername, the GUID, the discovery strings, expected network

connectivity. It does not contain node specific. The profile is managed by GPnPD, and it

exists on every node in the GPnP cache. When there are no updates to the profile, it is

identical on all cluster nodes. The way the best profile is judged, is via a sequence number.

The GPnP wallet is just a binary blob containing public / private RSA keys, used to sign and

verify the GPnP profile. The wallet is identical for all GPnP peers and once created by the

Oracle Universal Installer; it never changes and lives forever.

A typical profile would contain the information below. Never change the XML file directly;

instead, use supported tools, like OUI, ASMCA, asmcd, oifcfg etc. in order to modify GPnP

profile information.

The use of gpnptool to make changes to the GPnP profile is discouraged as multiple steps

have to be executed to even get a modification into the profile. If the modification adds

invalid content, it will certainly mess up the profile information and subsequent errors will

happen.

# gpnptool get

Warning: some command line parameters were defaulted. Resulting command line:

/scratch/grid_home_11.2/bin/gpnptool.bin get -o-

<?xml version="1.0" encoding="UTF-8"?><gpnp:GPnP-Profile Version="1.0"

xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:gpnp="http://www.grid-

pnp.org/2005/11/gpnp-profile" xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-

profile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile gpnp-profile.xsd"

ProfileSequence="4" ClusterUId="0cd26848cf4fdfdebfac2138791d6cf1"

ClusterName="stnsp0506" PALocation=""><gpnp:Network-Profile><gpnp:HostNetwork

id="gen" HostName="*"><gpnp:Network id="net1" IP="10.137.8.0" Adapter="eth0"

Use="public"/><gpnp:Network id="net2" IP="10.137.20.0" Adapter="eth2"

Use="cluster_interconnect"/></gpnp:HostNetwork></gpnp:Network-Profile><orcl:CSS-




41

Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/><orcl:ASM-Profile

id="asm" DiscoveryString="/dev/sdf*,/dev/sdg*,/voting_disk/vote_node1"

SPFile="+DATA/stnsp0506/asmparameterfile/registry.253.699162981"/>

<ds:Signature

xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><ds:SignedInfo><ds:CanonicalizationM

ethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/><ds:SignatureMethod

Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1"/><ds:Reference

URI=""><ds:Transforms><ds:Transform

Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/><ds:Transform

Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"> <InclusiveNamespaces

xmlns="http://www.w3.org/2001/10/xml-exc-c14n#" PrefixList="gpnp orcl

xsi"/></ds:Transform></ds:Transforms><ds:DigestMethod

Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/><ds:DigestValue>ORAmrPMJ/plFtG

Tg/mZP0fU8ypM=</ds:DigestValue></ds:Reference></ds:SignedInfo><ds:SignatureValue>K

u7QBc1/fZ/RPT6BcHRaQ+sOwQswRfECwtA5SlQ2psCopVrO6XJV+BMJ1UG6sS3vuP7CrS8LXrOTyoIxSkU

7xWAIB2Okzo/Zh/sej5O03GAgOvt+2OsFWX0iZ1+2e6QkAABHEsqCZwRdI4za3KJeTkIOPliGPPEmLuImu

DiBgMk=</ds:SignatureValue></ds:Signature></gpnp:GPnP-Profile>

Success.

The initial GPnP configuration is created and propagated by the root script as part of the

Oracle Clusterware installation. During a fresh install the profile content is sourced from the

Oracle Universal Installer interview results in Grid_home/crs/install/crsconfig_params.

1.6.2 GPnP Daemon

The GPnP daemon is like all other daemons OHASD managed and spawned by OHASD

oraagent. The main purpose of the GPnPD is to serve the profiles, therefore it must run in

order for the stack to start. The GPnPD startup sequence is mainly:

– detects running gpnpd, connects back to oraagent

– opens wallet/profile

– opens local/remote endpoints

– advertises remote endpoint with mdnsd

– starts OCR availability check

– discovers remote gpnpds

– equalizes profile

– starts to service clients




42

1.6.3 GPnP CLI Tools

There are a few client tools which indirectly perform GPnP profile changes. They require

ocssd to be running:

– crsctl replace discoverystring

– oifcfg getif / setif

– ASM – srvctl or sqlplus changing the spfile location or the ASM disk discoverystring

Note, that profile changes are serialized cluster-wide with a CSS lock (bug 7327595).

Grid_home/bin/gpnptool is the actual tool to manipulate the gpnp profile. To see the

detailed usage, run ‘gpnptool help’.

Oracle GPnP Tool

Usage:

"gpnptool <verb> <switches>", where verbs are:

create Create a new GPnP Profile

edit Edit existing GPnP Profile

getpval Get value(s) from GPnP Profile

get Get profile in effect on local node

rget Get profile in effect on remote GPnP node

put Put profile as a current best

find Find all RD-discoverable resources of given type

lfind Find local gpnpd server

check Perform basic profile sanity checks

c14n Canonicalize, format profile text (XML C14N)

sign Sign/re-sign profile with wallet's private key

unsign Remove profile signature, if any

verify Verify profile signature against wallet certificate

help Print detailed tool help

ver Show tool version

1.6.4 Debugging and Troubleshooting

In order to get more log and trace information there is a tracing environment variable

GPNP_TRACELEVEL which range is from [0-6]. The GPnP traces are located mainly at

Grid_home/log/<hostname>/alert*,

Grid_home/log/<hostname>/client/gpnptool*, other client logs

Grid_home/log/<hostname>/gpnpd|mdnsd/*

Grid_home/log/<hostname>/agent/ohasd/oraagent_<username>/*

The product setup files which are holding the initial information are located at




43

Grid_home/crs/install/crsconfig_params

Grid_home/cfgtoollogs/crsconfig/root*

Grid_home/gpnp/*, Grid_home /gpnp/<hostname>/* [profile+wallet]

If the GPnP setup is failing the following failure scenario checks should be performed.

– Failed to create wallet, profile? Failed to sign profile? Wrong signature? No access

to wallet or profile? [gpnpd is dead, stack is dead] (bug:8609709,bug:8445816)

– Missing/bad settings in profile (e.g. no discovery string, no interconnect, too many

interconnects)? [gpnpd is up, stack is dead – e.g. no voting files, no interconnects]

– Failed to propagate cluster-wide config? [gpnpd daemons are not communicating,

no put]

If something is failing during the GPnP runtime the following checks should be done.

– Is mdnsd running? Gpnpd failed to register with mdnsd? Discovery fails? [no put,

rget]

– Is gpnpd dead/not running? [no get, immediately fails]

– Is gpnpd is not fully up? [no get, no put, client spins in retries, times out]

– Discovering spurious nodes as a part of the cluster? [no put, can block gpnpd

dispatch]

– Is ocssd is not up? [no put]

– OCR was up, but failed [gpnpd dispatch can block, client waits in receive until OCR

recovers]

For all the above a first source would be the appropriate daemon log files and check the

resources status via crsctl stat res –init –t




44

Other troubleshooting steps if GPnPD is not running are:

– Check if the GPnP configuration is valid and check the GPnP log files for errors.

Some sanity checks can be done with gpnptool check or gpnptool verify

# gpnptool check -\

p=/scratch/grid_home_11.2/gpnp/stnsp006/profiles/peer/profile.xml

Profile cluster="stnsp0506", version=4

GPnP profile signed by peer, signature valid.

Got GPnP Service current profile to check against.

Current GPnP Service Profile cluster="stnsp0506", version=4

Error: profile version 4 is older than- or duplicate of- GPnP Service

current profile version 4.

Profile appears valid, but push will not succeed.

# gpnptool verify

Oracle GPnP Tool

verify Verify profile signature against wallet certificate

Usage:

"gpnptool verify <switches>", where switches are:

-p[=profile.xml] GPnP profile name

-w[=file:./] WRL-locator of OracleWallet with crypto

keys

-wp=<val> OracleWallet password, optional

-wu[=owner] Wallet certificate user (enum:

owner,peer,pa)-t[=3] Trace level (min..max=0..7), optional

-f=<val> Command file name, optional

-? Print verb help and exit

– Is gpnpd serving locally, this can be checked with gpnptool lfind

# gpnptool lfind

Success. Local gpnpd found.

‘gpnptool get’ should return the local profile information. If gpnptool lfind|get

hangs, a pstack from the hanging client and the GPnPD log files under

Grid_home/log/<hostname>/gpnpd would be beneficial for further debugging.




45

– To check if the remote GPnPD daemon is responding, the ‘find’ option is very

helpful:

# gpnptool find -h=stnsp006

Found 1 instances of service 'gpnp'.

mdns:service:gpnp._tcp.local.://stnsp006:17452/agent=gpnpd,cname=stnsp0506

,host=stnsp006,pid=13133/gpnpd h:stnsp006 c:stnsp0506

If the above is hanging or returns with an error, check the

Grid_home/log/<hostname>/mdnsd/*.log files and the gpnpd logs.

– To check if all the peers are responding, run gpnptool find –c=<clustername>

# gpnptool find -c=stnsp0506

Found 2 instances of service 'gpnp'.





We store copies of the GPnP profile in the local OLR and the OCR. In case of loss or

corruption, GPnPD pulls the information from there and recreates the profile.




46

1.7 Oracle Grid Naming Service (GNS):

GNS performs name resolution in the cluster. GNS doesn't always use mDNS for

performance reasons.

In Oracle Clusterware 11g release 2 (11.2) we support the use of DHCP for both the private

interconnect and for almost all virtual IP addresses on the public network. For clients outside

the cluster to find the virtual hosts in the cluster, we provide a Grid Naming Service (GNS).

This works with any higher-level DNS to provide resolvable names to external clients.

This section explains how to perform a simple setup of DHCP and GNS. A complex network

environment may require a more elaborate solution. The GNS and DHCP setup must be in

place before the grid infrastructure installation.

1.7.1 What Grid Naming Service Provides

DHCP provides dynamic configuration of the hosts IP address, but does not provide a goodway to produce names that are useful to external clients. As a result, it has been uncommon

in server complexes. In Oracle Clusterware 11g release 2 (11.2), this problem is solved by

providing our own service for resolving names in the cluster, and connecting this to the DNS

that is visible to the clients.

1.7.2 Network Configuration Steps

To get GNS to work for clients, it is necessary to configure the higher-level DNS to “delegate”

a subdomain to the cluster, and the cluster must run GNS on an address known to the DNS.

The GNS address will be maintained as a statically configured VIP in the cluster. The GNS

daemon (GNSD) will follow that VIP around the cluster and service names in the subdomain.

Four things need to be configured:

– A single static address in the public network for the cluster to use as the GNS VIP.

– Delegation from the higher-level DNS for names within the cluster sub-domain to

the GNS VIP.

– a DHCP for dynamic address provision on the public network

– a running cluster with properly configured GNS

1.7.2.1 Obtain an IP address for the GNS-VIP

Request an IP address from your network administrator to be assigned as the GNS-VIP. This

IP address is to be registered with the corporate DNS as the GNS-VIP for a given cluster, for

example strdv0108-gns.mycorp.com . Do not plumb this IP address; it will be

managed by clusterware after installing it.

1.7.2.2 Establish DNS delegation for the GNS sub-domain to the GNS-VIP

Create an entry of the following format in the appropriate DNS zone file:

# Delegate to gns on strdv0108




47

strdv0108-gns.mycorp.com NS strdv0108.mycorp.com

#Let the world know to go to the GNS vip

strdv0108.mycorp.com 10.9.8.7

Here, the sub-domain is strdv0108.mycorp.com, the GNS VIP has been assigned the

name strdv0108-gns.us.mycorp.com (corresponding to a chosen static IP address),

and the GNS daemon will listen on the default port 53.

NOTE: This does not establish an address for the name strdv0108.mycorp.com – it

creates a way of resolving a name within this sub-domain, such as clusterNode1-

VIP.strdv0108.mycorp.com.

1.7.3 DHCP

With DHCP, a host requiring an IP address sends a broadcast message to the hardware

network. A DHCP server on the segment can respond to the request, and give back an

address, along with other information such as what gateway to use, what DNS server(s) to

use, what domain should be used, what NTP server should be used, etc.

When we get DHCP for the public network, we have several IP addresses:

– One IP address per host (the node VIP)

– Three IP addresses per cluster for the cluster-wide SCAN.

The GNS VIP can’t be obtained from DHCP, because it must be known in advance, so must

be statically assigned.

The DHCP configuration file is /etc/dhcp.conf.

Using the following configuration example:

– the interface on the subnet is 10.228.212.0/10 (netmask 255.255.252.0)

– the addresses allowed to be served are 10.228.212.10 through 10.228.215.254

– the gateway is 10.228.212.1

– the domain the machines will reside in for DNS purposes is strdv0108.mycorp.com

/etc/dhcp.conf would contain something similar to:




48

subnet 10.228.212.0 netmask 255.255.252.0

{

default-lease-time 43200;

max-lease-time 86400;

option subnet-mask 255.255.252.0;

option broadcast-address 10.228.215.255;

option routers 10.228.212.1;

option domain-name-servers M.N.P.Q, W.X.Y.Z;

option domain-name "strdv0108.mycorp.com";

pool

{

range 10.228.212.10 10.228.215.254;

}

}

1.7.3.1 Name resolution

The /etc/resolv.conf must contain nameserver entries that are resolvable to corporate DNS

servers, and the total timeout period configured (a combination of options attempts

[retries] and options timeout [exponential backoff]) should be less than 30 seconds. For

example:

/etc/resolv.conf:

options attempts: 2

options timeout: 1

search us.mycorp.com mycorp.com

nameserver 130.32.234.42

nameserver 133.2.2.15

The /etc/nsswitch.conf controls name service lookup order. In some system configurations,

the Network Information System (NIS) can cause problems with Oracle SCAN address

resolution. It is suggested to place the NIS entry at the end of the search list.

/etc/nsswitch.conf

hosts: files dns nis

See Also: Oracle Grid Infrastructure Installation Guide,

"DNS Configuration for Domain Delegation to Grid Naming Service" for more information.

In Oracle Clusterware 11g release 2 (11.2) GNS is managed by a Clusterware agent

(orarootagent). The agent will start, stop and check the GNS. The SCAN agent advertises its

name and address with GNS and each SCAN VIP registers itself as well. All this is done during

the Oracle Universal Installer installation. The information about GNS is added to the OCR

and the GNS is added to the cluster through the srvctl add gns –d <mycluster.company.com>

command.




49

1.7.4 The GNS Server

During a server startup the GNS server retrieves the name from the subdomain to be

serviced from the OCR and starts the threads. The first thing the GNS server will do is a self

check once all the threads are running. It performs a test to see if the name resolution is

working. The client API is called to register a dummy name and address and the server then

attempts to resolve the name. If the resolution succeeds and one of the addresses matches

the dummy address, the self check has succeeded and a message is written to the cluster

alert<hostname>.log. This self check is done only once and even if the test is failing GNS

server keeps running.

The default trace location for GNS server is Grid_home/log/<hostname>/gnsd/. The trace file

format looks like the following:

<Time stamp>: [GNS][Thread ID]<Thread name>::<function>:<message>

2009-09-21 10:33:14.344: [GNS][3045873888] Resolve::clsgnmxInitialize:

initializing mutex 0x86a7770 (SLTS 0x86a777c).

1.7.5 The GNS Agent

The GNS Agent (orarootagent) will check the GNS server periodically. The check is done by

querying the GNS for its status.

To see if the agent is successfully advertising with GNS, run:

#grep -i 'updat.*gns'

Grid_home/log/<hostname>/agent/crsd/orarootagent_root/orarootagent_*

orarootagent_root.log:2009-10-07 10:17:23.513: [ora.gns.vip] [check] Updating GNS

with stnsp0506-gns-vip 10.137.13.245

orarootagent_root.log:2009-10-07 10:17:23.540: [ora.scan1.vip] [check] Updating

GNS with stnsp0506-scan1-vip 10.137.12.200





orarootagent_root.log:2009-10-07 10:17:23.597: [ora.stnsp005.vip] [check] Updating

GNS with stnsp005-vip 10.137.12.228

orarootagent_root.log:2009-10-07 10:17:23.615: [ora.stnsp006.vip] [check] Updating

GNS with stnsp006-vip 10.137.12.226




50

1.7.6 Command Line Interface

There command line interface to interact with GNS is via srvctl (the only supported way).

The crsctl can stop and start the ora.gns but we don’t support this other than told by

development directly.

GNS operations are run by performing operations on the “gns” noun so like:

# srvctl {start|stop|modify|etc.} gns ...

To start gns:

# srvctl start gns [-l <log_level>] - where –l is the level of logging that GNS

should run with.

To stop gns:

# srvctl stop gns

To advertise a name and address:

# srvctl modify gns -N <name> -A <address>

1.7.7 Debugging GNS

The default GNS server logging level is 0, which can be seen via a simple ps –ef | grep

gnsd.bin.

/scratch/grid_home_11.2/bin/gnsd.bin -trace-level 0 -ip-address 10.137.13.245 -

startup-endpoint ipc://GNS_stnsp005_31802_429f8c0476f4e1

To debug GNS server issues it is sometimes necessary to increase this log level. Which can be

done by stopping the GNS server via srvctl stop gns and restart it via srvctl start gns –v –l 5. Only the root user can stop and start the GNS.

Usage: srvctl start gns [-v] [-l <log_level>] [-n <node_name>]

-v Verbose output

-l <log_level> Specify the level of logging that GNS should run

with.

-n <node_name> Node name

-h Print usage

The trace level ranges from 0 to 6; level 5 should be sufficient in all the cases; setting the

trace level to level 6 is not recommended as gnsd will consume a lot of CPU.




51

Due to bug 8705125 in 11.2.0.1, the default logging level for GNS server (gnsd daemon) will

be level 6 after the initial installation. To set the log level back to the default value of 0, stop

and start the GNS using ‘srvctl stop / start’. This will only stop and start the gnsd.bin, and

will not cause any harm on the running cluster.

– srvctl stop gns

– srvctl start gns –l 0

To list the current GNS configuration srvctl should be used like:

srvctl config gns –a

GNS is enabled.

GNS is listening for DNS server requests on port 53

GNS is using port 5353 to connect to mDNS

GNS status: OK

Domain served by GNS: stnsp0506.oraclecorp.com

GNS version: 11.2.0.1.0

GNS VIP network: ora.net1.network

Starting with 11.2.0.2 the –l option (List all records in GNS) is a very helpful option in order

to debug GNS issues.




52

1.8 Grid Interprocess Communication

Grid Interprocess Communication (GIPC) is a new common communications infrastructure to

replace CLSC/NS. It provides a full control of the communications stack from the operatingsystem up to whatever client library uses it. The dependency on network services (NS) prior

to 11.2 is removed, but there is still backwards compatibility with existing CLSC clients

(mainly from 11.1).

GIPC can support multiple communications types: CLSC, TCP, UDP, IPC and the

communication type GIPC.

The configuration regarding listening endpoints with GIPC is a little different. The

private/cluster interconnects are now defined in the GPnP profile.

The requirement for the same interfaces to exist with the same name on all nodes is more

relaxed, as long as communication will be established. The part of the GPnP profile

regarding the private and public network configuration is:

<gpnp:Network id="net1" IP="10.137.8.0" Adapter="eth0" Use="public"/><gpnp:Network

id="net2" IP="10.137.20.0" Adapter="eth2" Use="cluster_interconnect"/>

1.8.1 Logs and Diagnostics

The GIPC default trace level only prints errors, and the default trace level for the different

components ranges from 0 to 2. To debug GIPC related issues, it might be necessary to

increase the trace levels, which are described below.

1.8.2 Setting trace levels via crsctl

With crsctl it is possible to set a GIPC trace level for different components.

Example:

# crsctl set log css COMMCRS:abcd

Where

• a denotes the trace level for NM

• b denotes the trace level for GM

• c denotes the trace level for GIPC

• d denotes the trace level for PROC




53

If the component of interest is GIPC, and want to modify the GIPC trace level only, up from

its default value of 2, simply run:

# crsctl set log css COMMCRS:2242

To turn on GIPC tracing for all components (NM, GM, etc.), set

# crsctl set log css COMMCRS:3 or

# crsctl set log css COMMCRS:4

With level 4, a lot of tracing is generated, so the ocssd.log will wrap around fairly quickly.

1.8.3 Setting trace levels via GIPC_TRACE_LEVEL and GIPC_FIELD_LEVEL

Another option is to set a pair of environment variables for the component using GIPC as

communication e.g. ocssd. In order to achieve this, a wrapper script is required. Taking

ocssd as an example, the wrapper script is Grid_home/bin/ocssd that invokes ‘ocssd.bin’.

Adding the variables below to the wrapper script (under the LD_LIBRARY_PATH) and

restarting ocssd will enable GIPC tracing. To restart ocssd.bin, perform a crsctl stop/start

cluster.

case `/bin/uname` in

Linux)

LD_LIBRARY_PATH=/scratch/grid_home_11.2/lib

export LD_LIBRARY_PATH

export GIPC_TRACE_LEVEL=4

export GIPC_FIELD_LEVEL=0x80

# forcibly eliminate LD_ASSUME_KERNEL to ensure NPTL where

available

LD_ASSUME_KERNEL=

export LD_ASSUME_KERNEL

LOGGER="/usr/bin/logger"

if [ ! -f "$LOGGER" ];then

LOGGER="/bin/logger"

fi

LOGMSG="$LOGGER -puser.err"

;;

This will set the trace level to 4. The values for the trace environment variables are




54

GIPC_TRACE_LEVEL=3 (valid range [0-6])

GIPC_FIELD_LEVEL=0x80 (only 0x80 is supported)

1.8.4 Setting trace levels via GIPC_COMPONENT_TRACE

To enable more fine grained tracing use the following environment variable

GIPC_COMPONENT_TRACE. The defined components are

GIPCGEN, GIPCTRAC, GIPCWAIT, GIPCXCPT, GIPCOSD, GIPCBASE, GIPCCLSA, GIPCCLSC,

GIPCEXMP, GIPCGMOD, GIPCHEAD, GIPCMUX, GIPCNET, GIPCNULL, GIPCPKT, GIPCSMEM,

GIPCHAUP, GIPCHALO, GIPCHTHR, GIPCHGEN, GIPCHLCK, GIPCHDEM, GIPCHWRK

Example:

# export GIPC_COMPONENT_TRACE=GIPCWAIT:4,GIPCNET:3

How does a trace message look like?

2009-10-23 05:47:40.952: [GIPCMUX][2993683344]gipcmodMuxCompleteSend: [mux]

Completed send req 0xa481c0e0 [00000000000093a6] { gipcSendRequest : addr '', data

0xa481c830, len 104, olen 104, parentEndp 0x8f99118, ret gipcretSuccess (0),

objFlags 0x0, reqFlags 0x2 }

2009-10-23 05:47:40.952: [GIPCWAIT][2993683344]gipcRequestSaveInfo: [req]

Completed req 0xa481c0e0 [00000000000093a6] { gipcSendRequest : addr '', data

0xa481c830, len 104, olen 104, parentEndp 0x8f99118, ret gipcretSuccess (0),

objFlags 0x0, reqFlags 0x4 }

Only some layers like CSS (client and server), GPNPD, GNSD, and small parts of MDNSD areusing GIPC right now.

Others like CRS/EVM/OCR/CTSS will use GIPC starting with 11.2.0.2. This is important to

know in order to turn on GIPC tracing or the old NS/CLSC tracing to debug communication

issues.




55

1.9 Cluster time synchronization service daemon (CTSS):

The CTSS is a new feature in Oracle Clusterware 11g release 2 (11.2), which takes care of

time synchronization in a cluster, in case the network time protocol daemon is not runningor is not configured properly.

The CTSS synchronizes the time on all of the nodes in a cluster to match the time setting on

the CTSS master node. When Oracle Clusterware is installed, the Cluster Time

Synchronization Service (CTSS) is installed as part of the software package. During

installation, the Cluster Verification Utility (CVU) determines if the network time protocol

(NTP) is in use on any nodes in the cluster. On Windows systems, CVU checks for NTP and

Windows Time Service.

If Oracle Clusterware finds that NTP is running or that NTP has been configured, then NTP is

not affected by the CTSS installation. Instead, CTSS starts in observer mode (this condition is

logged in the alert log for Oracle Clusterware). CTSS then monitors the cluster time and logsalert messages, if necessary, but CTSS does not modify the system time. If Oracle

Clusterware detects that NTP is not running and is not configured, then CTSS designates one

node as a clock reference, and synchronizes all of the other cluster member time and date

settings to those of the clock reference.

Oracle Clusterware considers an NTP installation to be misconfigured if one of the following

is true:

– NTP is not installed on all nodes of the cluster; CVU detects an NTP installation by a

configuration file, such as ntp.conf

– The primary and alternate clock references are different for all of the nodes of the

cluster

– The NTP processes are not running on all of the nodes of the cluster; only one type

of time synchronization service can be active on the cluster.

To check whether CTSS is running in active or observer mode run crsctl check ctss

CRS-4700: The Cluster Time Synchronization Service is in Observer mode.

or

CRS-4701: The Cluster Time Synchronization Service is in Active mode.

CRS-4702: Offset from the reference node (in msec): 100

The tracing for the ctssd daemon is written to the octssd.log. The alert log

(alert<hostname>.log) also contains information about the mode in which CTSS is running.

[ctssd(13936)]CRS-2403:The Cluster Time Synchronization Service on host node1 is

in observer mode.

[ctssd(13936)]CRS-2407:The new Cluster Time Synchronization Service reference node




56

is host node1.

[ctssd(13936)]CRS-2401:The Cluster Time Synchronization Service started on host

node1.

1.9.1 CVU checks

There are pre-install CVU checks performed automatically during installation, like: cluvfy

stage –pre crsinit <>

This step will check and make sure that the operating system time synchronization software

(e.g. NTP) is either properly configured and running on all cluster nodes, or on none of the

nodes.

During the post-install check, CVU will run cluvfy comp clocksync –n all. If CTSS is in observer

mode, it will perform a configuration check as above. If the CTSS is in active mode, we verify

that the time difference is within the limit.

1.9.2 CTSS resource

When CTSS comes up as part of the clusterware startup, it performs step time sync, and if

everything goes well, it publishes its state as ONLINE. There is a start dependency on

ora.cssd but note that it has no stop dependency, so if for some reasons (maybe faulted

CTSSD), CTSSD dumps core or exits, nothing else should be affected.

The chart below shows the start dependency build on ora.ctssd for other resources.

Figure 5: ora.ctssd start dependency picture.




57

crsctl stat res ora.ctssd -init –t

----------------------------------------------------------------------

NAME TARGET STATE SERVER STATE_DETAILS

----------------------------------------------------------------------

ora.ctssd

1 ONLINE ONLINE node1 OBSERVER

1.10 mdnsd

1.10.1 Debugging mdnsd

In order to capture mdnsd network traffic, use the mDNS Network Monitor located in

Grid_home/bin:

# mkdir Grid_home/log/$HOSTNAME/netmon

# Grid_home/bin/oranetmonitor &

The output from oranetmonitor will be captured in netmonOUT.log in the above directory.




58

2 Voting Files and Oracle Cluster Repository Architecture

Storing OCR and the voting files in ASM eliminates the need for third-party cluster volume

managers and eliminates the complexity of managing disk partitions for OCR and voting files

in Oracle Clusterware installations.

2.1 Voting File in ASM

ASM manages voting files differently from other files that it stores. When voting files are

placed on disks in an ASM disk group, Oracle Clusterware records exactly on which disks in

that diskgroup they are located. If ASM fails, then CSS can still access the voting files. If you

choose to store voting files in ASM, then all voting files must reside in ASM, i.e. we do not

support mixed configurations like storing some voting files in ASM and some on NAS.

The number of voting files you can store in a particular Oracle ASM disk group depends upon

the redundancy of the disk group.

– External redundancy: A disk group with external redundancy can store only one

voting file

– Normal redundancy: A disk group with normal redundancy can store up to three

voting files

– High redundancy: A disk group with high redundancy can store up to five voting files

By default, Oracle ASM puts each voting file in its own failure group within the disk group. A

failure group is a subset of the disks in a disk group, which could fail at the same time

because they share hardware, e.g. a disk controller. The failure of common hardware must

be tolerated. For example, four drives that are in a single removable tray of a large JBOD

(Just a Bunch of Disks) array are in the same failure group because the tray could beremoved, making all four drives fail at the same time. Conversely, drives in the same cabinet

can be in multiple failure groups if the cabinet has redundant power and cooling so that it is

not necessary to protect against failure of the entire cabinet. However, Oracle ASM

mirroring is not intended to protect against a fire in the computer room that destroys the

entire cabinet. If voting files stored on Oracle ASM with Normal or High redundancy, and the

storage hardware in one failure group suffers a failure, then if there is another disk available

in a disk group in an unaffected failure group, Oracle ASM recovers the voting file in the

unaffected failure group.

2.2 Voting File Changes

–

The voting files formation critical data are stored in the voting file and not in theOCR anymore. From a voting file perspective, the OCR is not touched at all. The

critical data each node must agree on to form a cluster are e.g. misscount and the

list of voting files configured.




59

– In Oracle Clusterware 11g release 2 (11.2), it is no longer necessary to back up the

voting disk. The voting disk data is automatically backed up in OCR as part of any

configuration change and is automatically restored to any voting disk that is being

added. If all voting disks are corrupted, however, you can restore them as described

in the Oracle Clusterware Administration and Deployment Guide.

– New blocks added to the voting files are the voting file identifier block (needed for

voting file stored in ASM), and it contains the cluster GUID and the file UID. The

committed and pending configuration incarnation number (CCIN and PCIN) contain

this formation critical data.

– To query the configured voting files and to see their location run crsctl query css

votedisk :

$ crsctl query css votedisk

## STATE File Universal Id File Name Disk group

-- ----- ----------------- --------- ----------

1. ONLINE 3e1836343f534f51bf2a19dff275da59 (/dev/sdf10) [DATA]

2. ONLINE 138cbee15b394f3ebf57dbfee7cec633 (/dev/sdg11) [DATA]

3. ONLINE 462722bd24c94f70bf4d90539c42ad4c (/dev/sdu12) [DATA]

Located 3 voting file(s).

– Voting files that reside in ASM

o Voting files that reside in ASM may be automatically deleted and added

back if one of the existing voting files gets corrupted.

– Voting files can be migrated from/to NAS/ASM and from ASM to ASM with e.g

$ crsctl replace css votedisk /nas/vdfile1 /nas/vdfile2 /nas/vdfile3

or

$ crsctl replace css votedisk +OTHERDG

– If all voting files are corrupted, however, you can restore them as described below.

If the cluster is down and cannot restart due to lost voting files, then you must start

CSS in exclusive mode to replace the voting files by entering the following

command:

o # crsctl start crs –excl (on one node only)

o # crsctl delete css votedisk FUID

o # crsctl add css votedisk path_to_voting_disk

– In case of extended Oracle Clusterware / extended RAC configuration, the third

voting file must be located on a third storage on a third side to prevent you from a

data center outage. We do support a third voting on standard NFS. For more




60

information see Appendix “Oracle Clusterware 11g release 2 (11.2) - Using standard

NFS to support a third voting file on a stretch cluster configuration”.

See Also: Oracle Clusterware Administration and Deployment Guide, "Voting file, OracleCluster Registry, and Oracle Local Registry" for more information. For information about

extended clusters and how to configure the quorum voting file see the Appendix.

2.3 Oracle Cluster Registry (OCR)

As of 11.2, OCR can also be stored in ASM. The ASM partnership and status table (PST) is

replicated on multiple disks and is extended to store OCR. Consequently, OCR can tolerate

the loss of the same number of disks as are in the underlying disk group, and be can

relocated / rebalanced in response to disk failures.

In order to store an OCR on a disk group, the disk group has a ‘special’ file type called ‘ocr’.

The default configuration location is /etc/oracle/ocr.loc

# cat /etc/oracle/ocr.loc

ocrconfig_loc=+DATA

local_only=FALSE

From a user and maintenance perspective, the rest remains the same. The OCR can only be

configured in ASM when the cluster completely migrated to 11.2 (crsctl query crs

activeversion >= 11.2.0.1.0). We still support mixed configurations, so we could have OCR’s

stored in ASM and another stored on a supported NAS device, as we support up to 5 OCR

locations in 11.2.0.1. We do not support raw or block devices for neither OCR nor voting files

anymore.

The OCR diskgroup is auto mounted by the ASM instance during startup. The CRSD and ASMdependency is maintained by OHASD.

OCRCHECK

There are small enhancements in ocrcheck like the –config which is only checking the

configuration. Run ocrcheck as root otherwise the logical corruption check will not run. To

check OLR data use the –local keyword.

Usage: ocrcheck [-config] [-local]

Shows OCR version, total, used and available space

Performs OCR block integrity (header and checksum) checks

Performs OCR logical corruption checks (11.1.0.7)

‘-config’ checks just configuration (11.2)

‘-local’ checks OLR, default OCR

Can be run when stack is up or down




61

The output is similar like below:

# ocrcheck

Status of Oracle Cluster Registry is as follows:

Version : 3

Total space (kbytes) : 262120

Used space (kbytes) : 3072

Available space (kbytes) : 259048

ID : 701301903

Device/File Name : +DATA

Device/File integrity check succeeded

Device/File Name : /nas/cluster3/ocr3








Cluster registry integrity check succeeded

Logical corruption check succeeded

2.4 Oracle Local Registry (OLR)

The OLR, similar in structure as the OCR, is a node-local repository, and is managed by

OHASD. The configuration data in OLR pertains to the local node only, and is not shared

among other nodes.

The configuration is stored in ‘/etc/oracle/olr.loc’ (on Linux) or equivalent on other OS. The

default location after installing Oracle Clusterware is:

– RAC: Grid_home/cdata/<hostname.olr>

– Oracle Restart: Grid_home/cdata/localhost/hostname.

The information stored in the OLR is needed by OHASD to start or join a cluster; this includes

data about GPnP wallets, clusterware configuration and version information.

OLR keys have the same properties as OCR keys and the same tools are used to either check

or dump them.

To see the OLR location, run the command:




62

# ocrcheck -local –config

Oracle Local Registry configuration is :

Device/File Name : Grid_home/cdata/node1.olr

To dump the OLR content, run the command:

# ocrdump -local –stdout (or filename)

ocrdump –h to get the usage

See Also: Oracle Clusterware Administration and Deployment Guide, "Managing the Oracle

Cluster Registry and Oracle Local Registries" for more information about using the ocrconfig

and ocrcheck.

2.5 Bootstrap and Shutdown if OCR is located in ASM

ASM has to be up with the diskgroup mounted before any OCR operations can beperformed. There are bugs reported when the diskgroup having OCR was dismounted force

and/or ASM instance was shutdown abort.

When the stack is running, CRSD keeps reading/writing OCR.

OHASD maintains the resource dependency and will bring up ASM with the required

diskgroup mounted before it starts CRSD.

Once ASM is up with the diskgroup mounted, the usual ocr* commands (ocrcheck,

ocrconfig, etc.) can be used.

The shutdown command will fail with an ORA-15097 for the ASM instance with an active

OCR (meaning that crsd is running on this node) in it. In order to see which clients areaccessing ASM, use the commands

asmcmd lsct (v$asm_client)

DB_Name Status Software_Version Compatible_version Instance_Name

Disk_Group

+ASM CONNECTED 11.2.0.1.0 11.2.0.1.0 +ASM2

DATA

asmcmd lsof

DB_Name Instance_Name Path

+ASM +ASM2 +data.255.4294967295

Where +data.255 is the OCR file number which is used to identify the OCR file within ASM.

2.6 OCR in ASM diagnostics

If any error occurs,




63

– Ensure that the ASM instance is up and running with the required diskgroup

mounted, and/or check ASM alert.log for the status for the ASM instance.

– Verify that the OCR files were properly created in the diskgroup, using asmcmd ls. Since the clusterware stack keeps accessing OCR files, most of the time the error

will show up as a CRSD error in the crsd.log. Any error related to an ocr* command

(like crsd, also considered an ASM client) will generate a trace file in the

Grid_home/log/<hostname>/client directory; in either case, look for kgfo / kgfp /

kgfn at the top of the error stack.

– Confirm that the ASM compatible.asm property of the diskgroup is set to at least

11.2.0.0.

2.7 The ASM Diskgroup Resource

When the diskgroup is created, the diskgroup resource is automatically created with the

name, ora.<DGNAME>.dg and the status is set to ONLINE. The status OFFLINE will be set if

the diskgroup is dismounted, as this is a CRS managed resource now. When the diskgroup is

dropped the diskgroup resource is removed as well.

A dependency between the database and the diskgroup is automatically created when the

database tries to access the ASM files. However, when the database does not longer uses

the ASM files or the ASM files are removed, we do not remove the database dependency

automatically. This must be done using the srvctl command line tool.

Typical ASM alert.log messages for success/failure and warnings are

Success:

NOTE: diskgroup resource ora.DATA.dg is offline

NOTE: diskgroup resource ora.DATA.dg is online

Failure

ERROR: failed to online diskgroup resource ora.DATA.dg

ERROR: failed to offline diskgroup resource ora.DATA.dg

Warning

WARNING: failed to online diskgroup resource ora.DATA.dg (unable to

communicate with CRSD/OHASD)

This warning may appear when the stack is started

WARNING: unknown state for diskgroup resource ora.DATA.dg

If errors happen, look at the ASM alert.log for the related resource operation status message

like,




64

“ERROR”: the resource operation failed; check CRSD log and Agent log for more

details

Grid_home/log/<hostname>/crsd/

Grid_home/log/<hostname>/agent/crsd/oraagent_user/

“WARNING”: cannot communicate with CRSD.

This warning can be ignored during bootstrap as ASM instance starts up and mount the

diskgroup before CRSD.

The status of the diskgroup resource and the diskgroup should be consistent. In rare cases,

they may become out of sync transiently. To get them back in sync manually run srvctl to

sync the status, or wait some time for the agent to refresh the status. If they become out of

sync for a long period, please check CRSD log and ASM log for more details.

To turn on more comprehensive tracing use event="39505 trace name context forever, level1“.

2.8 The Quorum Failure Group

A quorum failure group is a special type of failure group and disks in these failure groups do

not contain user data and are not considered when determining redundancy requirements.

The COMPATIBLE.ASM disk group compatibility attribute must be set to 11.2 or greater to

store OCR or voting file data in a disk group.

During Oracle Clusterware installation we do not offer to create a quorum failure group

which is needed for a third voting files in case of extended / stretched clusters or two

storage arrays.

Create a diskgroup with a failgroup and optionally a quorum failgroup if a third array is

available.

SQL> CREATE DISKGROUP PROD NORMAL REDUNDANCY

FAILGROUP fg1 DISK ‘<a disk in SAN1>’

FAILGROUP fg2 DISK ‘<a disk in SAN2>’

QUORUM FAILGROUP fg3 DISK ‘<another disk or file on a third location>’

ATTRIBUTE ‘compatible.asm’ = ’11.2.0.0’;

If the diskgroup creation was done using ASMCA, then after adding a quorum disk to the disk

group, Oracle Clusterware will automatically change the CSS votedisk location to something

like below:



-- ----- ----------------- --------- ---------

1. ONLINE 3e1836343f534f51bf2a19dff275da59 (/dev/sdg10) [DATA]




65

2. ONLINE 138cbee15b394f3ebf57dbfee7cec633 (/dev/sdf11) [DATA]

3. ONLINE 462722bd24c94f70bf4d90539c42ad4c (/voting_disk/vote_node1)

[DATA]


If it is done via SQL*PLUS the crsctl replace css votedisk must be used.

See Also: Oracle Database Storage Administrator's Guide, "Oracle ASM Failure Groups" for

more information. Oracle Clusterware Administration and Deployment Guide, "Voting file,

Oracle Cluster Registry, and Oracle Local Registry" for more information about backup and

restore and failure recovery.

2.9 ASM spfile

2.9.1 ASM spfile location

Oracle recommends that the Oracle ASM SPFILE is placed in a disk group. You cannot use anew alias created on an existing Oracle ASM SPFILE to start up the Oracle ASM instance.

If you do not use a shared Oracle grid infrastructure home, then the Oracle ASM instance

can use a PFILE. The same rules for file name, default location, and search order that apply

to database initialization parameter files also apply to Oracle ASM initialization parameter

files.

When an Oracle ASM instance searches for an initialization parameter file, the search order

is:

– The location of the initialization parameter file specified in the Grid Plug and Play

(GPnP) profile

– If the location has not been set in the GPnP profile, the search order changes to:

o SPFILE in the Oracle ASM instance home

For example, the SPFILE for Oracle ASM has the following default path in

the Oracle grid infrastructure home in a Linux environment:

$ORACLE_HOME/dbs/spfile+ASM.ora

o PFILE in the Oracle ASM instance home

2.9.2 Backing Up, Moving a ASM spfile

You can back up, copy, or move an Oracle ASM SPFILE with the ASMCMD spbackup, spcopy

or spmove commands. For information about these ASMCMD commands see the Oracle

Database Storage Administrator's Guide.

See Also: Oracle Database Storage Administrator's Guide "Configuring Initialization

Parameters for an Oracle ASM Instance" for more information.




66

3 Resources

Oracle Clusterware manages applications and processes as resources that you register with

Oracle Clusterware. The number of resources you register with Oracle Clusterware to

manage an application depends on the application. Applications that consist of only one

process are usually represented by only one resource. More complex applications, built on

multiple processes or components, may require multiple resources.

3.1 Resource types

Generally, all resources are unique but some resources may have common attributes. Oracle

Clusterware uses resource types to organize these similar resources. Using resource types

provides the following benefits:

– Manage only necessary resource attributes

– Manage all resources based on the resource type

Every resource that is registered in Oracle Clusterware must have a certain resource type. In

addition to the resource types included in Oracle Clusterware, custom resource types can be

defined using the crsctl utility. The included resource types are:

– Base resource: base type

– Local resource: instances of local resources (type name is local_resource) run on

each server of the cluster, e.g. ora.node14.vip.

– Cluster resource: cluster-aware resource types (type name is cluster_resource) are

aware of the cluster environment and are subject to cardinality and cross-server

switchover and failover; example: ora.asm.

All user-defined resource types must be based, directly or indirectly, on either the

local_resource or cluster_resource type.

In order to list all defined types and their base types, run the crsctl stat type command:

TYPE_NAME=application


TYPE_NAME=cluster_resource

BASE_TYPE=resource

TYPE_NAME=local_resource

BASE_TYPE=resource

TYPE_NAME=ora.asm.type

BASE_TYPE=ora.local_resource.type

TYPE_NAME=ora.cluster_resource.type





67

TYPE_NAME=ora.cluster_vip.type

BASE_TYPE=ora.cluster_resource.type

TYPE_NAME=ora.cluster_vip_net1.type

BASE_TYPE=ora.cluster_vip.type

TYPE_NAME=ora.database.type


TYPE_NAME=ora.diskgroup.type


TYPE_NAME=ora.eons.type


TYPE_NAME=ora.gns.type


TYPE_NAME=ora.gns_vip.type


TYPE_NAME=ora.gsd.type


TYPE_NAME=ora.listener.type


TYPE_NAME=ora.local_resource.type

BASE_TYPE=local_resource

TYPE_NAME=ora.network.type


TYPE_NAME=ora.oc4j.type


TYPE_NAME=ora.ons.type


TYPE_NAME=ora.registry.acfs.type


TYPE_NAME=ora.scan_listener.type


TYPE_NAME=ora.scan_vip.type





68

TYPE_NAME=resource

BASE_TYPE=

To list all the attributes and default values for a type, run crsctl stat type <typeName> -f (forfull configuration) or –p (for static configuration).

3.1.1 Base Resource Type Definition

This section specifies the attributes that make up the resource type definition. A resource is

an abstract and read-only type definition. The type may only serve as a base for other types.

Oracle Clusterware 11.2.0.1 will not allow user-defined types to extend this type directly.

To see all default values and names from the base resource type, run crsctl stat type

resource –p.

Name History Description

NAME From

10gR2

The name of the resource. Resource names must be unique

and may not be modified once the resource is created.

TYPE From

10gR2,

modified

Semantics are unchanged; values other than application exist

Type: string

Special Values: No

CHECK_INTERVAL From

10gR2

Unchanged

Type: unsigned integer

Special Values: No

Per-X Support: Yes

DESCRIPTION From

10gR2

Unchanged

Type: string

Special Values: No

RESTART_ATTEMPTS From

10gR2

Unchanged


Special Values: No

Per-X Support: Yes

START_TIMEOUT From

10gR2

Unchanged


Special Values: No

Per-X Support: Yes

STOP_TIMEOUT From

10gR2

Unchanged


Special Values: No




69

Per-X Support: Yes

SCRIPT_TIMEOUT From

10gR2

Unchanged

Type: unsigned integerSpecial Values: No

Per-X Support: Yes

UPTIME_THRESHOLD From

10gR2

Unchanged

Type: string

Special Values: No

Per-X Support: Yes

AUTO_START From

10gR2

Unchanged

Type: string

Format: restore|never|always

Required: NoDefault: restore

Special Values: No

BASE_TYPE New The name of the base type from which this type extends. This

is the value of the “TYPE” in the base type’s profile.

Type: string

Format: [name of the base type]

Required: Yes

Default: empty string (none)

Special Values: No

Per-X Support: No

DEGREE New This is the count of the number of instances of the resource

that are allowed to run on a single server. Today’s

application has a fixed degree of one. Degree supports

multiplicity within a server


Format: [number of attempts, >=1]

Required: No

Default: 1

Special Values: No

ENABLED New The flag that governs the state of the resource as far as being

managed by Oracle Clusterware, which will not attempt to

manage a disabled resource whether directly or because of a

dependency to another resource. However, stopping of the

resource when requested by the administrator will be allowed




70

(so as to make it possible to disable a resource without having

to stop it). Additionally, any change to the resource’s state

performed by an ‘outside force’ will still be proxied into the

clusterware.


Format: 1 | 0

Required: No

Default: 1

Special Values: No

Per-X Support: Yes

START_DEPENDENCIES New Specifies a set of relationships that govern the start of the

resource.

Type: stringRequired: No

Default:

Special Values: No

STOP_DEPENDENCIES New Specifies a set of relationships that govern the stop of the

resource.

Type: string

Required: No

Default:

Special Values: No

AGENT_FILENAME New An absolute filename (that is, inclusive of the path and file

name) of the agent program that handles this type. Every

resource type must have an agent program that handles its

resources. Types can do so by either specifying the value for

this attribute or inheriting it from their base type.

Type: string

Required: Yes

Special Values: Yes

Per-X Support: Yes (per-server only)

ACTION_SCRIPT From

10gR2,

modified

An absolute filename (that is, inclusive of the path and file

name) of the action script file. This attribute is used in

conjunction with the AGENT_FILENAME. CRSD will invoke the

script in the manner it did in 10g for all entry points

(operations) not implemented in the agent binary. That is, if

the agent program implements a particular entry point, it is




71

invoked; if it does not, the script specified in this attribute will

be executed.

Please note that for backwards compatibility with previousreleases, a built-in agent for the application type will be

included with CRS. This agent is implemented to always

invoke the script specified with this attribute.

Type: string

Required: No

Default:

Special Values: Yes

Per-X Support: Yes (per-server only)

ACL New Contains permission attributes. The value is populated at

resource creation time based on the identity of the processcreating the resource, unless explicitly overridden. The value

can subsequently be changed using the APIs/command line

utilities, provided that such a change is allowed based on the

existing permissions of the resource.

Format:owner:<user>:rwx,pgrp:<group>:rwx,other::r—

Where

owner: the OS User of the resource owner, followed by the

permissions that the owner has. Resource actions will be

executed as with this user ID.

pgrp: the OS Group that is the resource’s primary group,

followed by the permissions that members of the group have

other: followed by permissions that others have

Type: string

Required: No

Special Values: No

STATE_CHANGE_EVENT_TEM

PLATE

New The template for the State Change events. Type: string

Required: No

Default:Special Values: No

PROFILE_CHANGE_EVENT_TE

MPLATE

New The template for the Profile Change events. Type: string

Required: No

Default:




72

Special Values: No

ACTION_FAILURE_EVENT_TE

MPLATE

New The template for the State Change events.

Type: string

Required: No

Default:

Special Values: No

LAST_SERVER New An internally managed, read-only attribute that contains the

name of the server on which the last start action has

succeeded.

Type: string

Required: No, read-only

Default: emptySpecial Values: No

OFFLINE_CHECK_INTERVAL New Used for controlling off-line monitoring of a resource. The

value represents the interval (in seconds) to use for implicitly

monitoring the resource when it is OFFLINE. The monitoring is

turned off if the value is 0


Required: No

Default: 0

Special Values: No

Per-X Support: Yes




73

STATE_DETAILS New An internally managed, read-only attribute that contains

details about the state of the resource. The attribute fulfills

the following needs:

1. CRSD understood resource states (Online, Offline,

Intermediate, etc) may map to different resource-specific

values (mounted, unmounted, open, closed, etc). In order to

provide a better description of this mapping, resource agent

developers may choose to provide a ‘state label’ as part of

providing the value of the STATE.

2. Providing the label, unlike the value of the resource state,

is optional. If not provided, the Policy Engine will use CRSD-

understood state values (Online, Offline, etc). Additionally, in

the event the agent is unable to provide the label (as may also

happen to the value of STATE), the Policy Engine will set the

value of this attribute to do it is best at providing the details

as to why the resource is in the state it is (why it is

Intermediate and/or why it is Unknown)

Type: string

Required: No, read-only

Default: empty

Special Values: No

3.1.2 Local Resource Type Definition

The local_resource type is the basic building block for resources that are instantiated for

each server but are cluster oblivious and have a locally visible state. While the definition of

the type is global to the clusterware, the exact property values of the resource instantiation

on a particular server are stored on that server. This resource type has no equivalent in

Oracle Clusterware 10gR2 and is a totally new concept to Oracle Clusterware.

The following table specifies the attributes that make up the local_resource type definition.

To see all default values run the command crsctl stat type local_resource –p.

Name Description

ALIAS_NAME Type: stringRequired: No

Special Values: Yes

Per-X Support: No

LAST_SERVER Overridden from resource: the name of the server to which the resource




74

is assigned (“pinned”).

Only Cluster Administrators will be allowed to register local resources.

3.1.3 Cluster Resource Type Definition

The cluster_resource is the basic building block for resources that are cluster aware and

have globally visible state. 11.1‘s application is a cluster_resource. The type’s base is

resource. The type definition is read-only. The following table specifies the attributes that

make up the cluster_resource type definition.

The following table specifies the attributes that make up the cluster_resource type

definition. Run crsctl stat type cluster_resource –p to see all default values.

Name History Description

ACTIVE_PLACEMENT From 10gR2 Unchanged


Special Values: No

FAILOVER_DELAY From 10gR2 Unchanged, Deprecated

Special Values: No

FAILURE_INTERVAL From 10gR2 Unchanged


Special Values: No

Per-X Support: Yes

FAILURE_THRESHOLD From 10gR2 Unchanged


Special Values: No

Per-X Support: Yes

PLACEMENT From 10gR2 Format: value

where value is one of the following:

restricted

Only servers that belong to the associated server

pool(s) or hosting members may host instances of the

resource.

favored

If only SERVER_POOLS or HOSTING_MEMBERS

attribute is non-empty, servers belonging to the




75

specified server pool(s)/hosting member list will be

considered first if available; if/when none are available,

any other server will be used.

If both SERVER_POOLS and HOSTING_MEMBERS are

populated, the former indicates preference while the

latter – restricts the choices to the servers within that

preference

balanced

Any ONLINE, enabled server may be used for

placement. Less loaded servers will be preferred to

more loaded ones. To measure how loaded a server is,

clusterware will use the LOAD attribute of resources

that are ONLINE on the server. The sum total of LOADvalues is used as the absolute measure of the current

server load.

Type: string

Default: balanced

Special Values: No

HOSTING_MEMBERS From 10g The meaning from this attribute is taken from the

previous release.

Although not officially deprecated, the use of this

attribute is discouraged.

Special Values: No

Required: @see SERVER_POOLS

SERVER_POOLS New Format:

* | [<pool name1> […]]

This attribute creates an affinity between the resource

and one or more server pools as far as placement goes.

The meaning of this attribute depends on what the

value of PLACEMENT is.

When a resource should be able to run on any server of

the cluster, a special value of * needs to be used. Note

that only Cluster Administrators can specify * as the

value for this attribute.






77

Required: No

Default: 1

Special Values: No

Per-X Support: Yes

3.2 Resource Dependencies

With Oracle Clusterware 11.2 a new dependency concept is introduced, to be able to build

dependencies for start and stop actions independent and have a much better granularity.

3.2.1 Hard Dependency

If resource A has a hard dependency on resource B, B must be ONLINE before A will be

started. Please note there is no requirement that A and B be located on the same server.

A possible parameter to this dependency would allow resource B to be in either in ONLINE

or INTERMEDIATE state. Such a variation is sometimes referred to as the intermediate

dependency.

Another possible parameter to this dependency would make it possible to differentiate if A

requires that B be present on the same server or on any server in the cluster. In other words,

this illustrates that the presence of resource B on the same server as A is a must for resource

A to start.

If the dependency is on a resource type, as opposed to a concrete resource, this should be

interpreted as “any resource of the type”. The aforementioned modifiers for locality/state

still apply accordingly.

3.2.2 Weak Dependency

If resource A has a weak dependency on resource B, an attempt to start of A will attempt to

start B if is not ONLINE. The result of the attempt to start B is, however, of no consequence

to the result of starting A (it is ignored). Additionally, if start of A causes an attempt to start

B, failure to start A has no affect on B.

A possible parameter to this dependency is whether or not the start of A should wait for

start of B to complete or may execute concurrently.

Another possible parameter to this dependency would make it possible to differentiate if A

desires that B be running on the same server or on any server in the cluster. In other words,

this illustrates that the presence of resource B on the same server as A is a desired for

resource A to start. In addition to the desire to have the dependent resource started locally

or on any server in the cluster, another possible parameter is to start the dependent

resource on every server where it can run.




78


interpreted as “every resource of the type”. The aforementioned modifiers for locality/state


3.2.3 Attraction

If resource A attracts B, then whenever B needs to be started, servers that currently have A

running will be first on the list of placement candidates. Since a resource may have more

than one resource to which it is attracted, the number of attraction-exhibiting resources will

govern the order of precedence as far as server placement goes.


interpreted as “any resource of the type”.

A possible flavor of this relation is to require that a resource’s placement be re-evaluated

when a related resource’s state changes. For example, resource A is attracted to B and C. At

the time of starting A, A is started where B is. Resource C may either be running or started

thereafter. Resource B is subsequently shut down/fails and does not restart. Then resource

A requires that at this moment its placement be re-evaluated and it be moved to C. This is

somewhat similar to the AUTOSTART attribute of the resource profile, with the dependent

resource’s state change acting as a trigger as opposed to a server joining the cluster.

A possible parameter to this relation is whether or not resources in intermediate state

should be counted as running thus exhibit attraction or not.

If resource A excludes resource B, this means that starting resource A on a server where B is

running will be impossible. However, please see the dependency’s namesake for STOP to

find out how B may be stopped/relocated so A may start.

3.2.4 Pull-up

If a resource A needs to be auto-started whenever resource B is started, this dependency is

used. Note that the dependency will only affect A if it is not already running. As is the case

for other dependency types, pull-up may cause the dependent resource to start on any or

the same server, which is parameterized. Another possible parameter to this dependency

would allow resource B to go to either in ONLINE or INTERMEDIATE state to trigger pull-up

of A. Such a variation is sometimes referred to as the intermediate dependency. Note that if

resource A has pull-up relation to resources B and C, then it will only be pulled up when both

B and C are started. In other words, the meaning of resources mentioned in the pull-up

specification is interpreted as a Boolean AND.

Another variation in this dependency is if the value of the TARGET of resource A plays a role:

in some cases, a resource needs to be pulled-up irrespective of its TARGET while in others

only if the value of TARGET is ONLINE. To accommodate both needs, the relation offers a

modifier to let users specify if the value of the TARGET is irrelevant; by default, pull-up will




79

only start resources if their TARGET is ONLINE. Note that this modifier is on the relation, not

on any of the targets as it applies to the entire relation.

If the dependency is on a resource type, as opposed to a concrete resource, this should beinterpreted as “any resource of the type”. The aforementioned modifiers for locality/state


3.2.5 Dispersion

The property between two resources that desire to avoid being co-located, if there’s no

alternative other than one of them being stopped, is described by the use of the dispersion

relation. In other words, if resource A prefers to run on a different server than the one

occupied by resource B, then resource A is said to have a dispersion relation to resource B at

start time. This sort of relation between resources has an advisory effect, much like that of

attraction: it is not binding as the two resources may still end up on the same server.

A special variation on this relation is whether or not crsd is allowed/expected to disperse

resources, once it is possible, that are already running. In other words, normally, crsd will

not disperse co-located resources when, for example, a new server becomes online: it will

not actively relocate resources once they are running, only disperse them when starting

them. However, if the dispersion is ‘active’, then crsd will try to relocate one of the

resources that disperse to the newly available server.

A possible parameter to this relation is whether or not resources in intermediate state

should be counted as running thus exhibit attraction or not.




80

4 Fast Application Notification (FAN)

4.1 Event Sources

In 11.2, the CRSD master is the originator of most events, and the database is the source of the Remote Load Balance (RLB) events. The CRSD master passes events from the

PolicyEngine thread to the ReporterModule thread, in which the events are translated to

eONS events, and then the events are sent out to peers within the cluster. If eONS is not

running, the ReporterModule attempts to cache the events until it the eONS server is

running, and then retries. The events are guaranteed to be sent and received in the order in

which the actions happened.

4.2 Event Processing architecture in oraagent

4.2.1 database / ONS / eONS agents

Every node runs one database agent, one ONS agent, and one eONS agent within crsd's

oraagent process. These agents are responsible for stop/start/check actions. There are no

dedicated threads for each agent; instead oraagent use a pool of threads to execute these

actions for the various resources.

4.2.2 eONS subscriber threads

Each of the three agents (as mentioned above) is associated with one other thread in the

oraagent that is blocked on ons_subscriber_receive(). These eONS subscriber threads can be

identified by the string "Thread:[EonsSub ONS]", "Thread:[EonsSub EONS]" and

"Thread:[EonsSub FAN]" in the oraagent log. In the example below, a service was stopped

and this node's crsd oraagent process and its three eONS subscriber received the event:

2009-05-26 23:36:40.479: [AGENTUSR][2868419488][UNKNOWN] Thread:[EonsSub FAN]

process {

2009-05-26 23:36:40.500: [AGENTUSR][2868419488][UNKNOWN] Thread:[EonsSub FAN]

process }

2009-05-26 23:36:40.540: [AGENTUSR][2934963104][UNKNOWN] Thread:[EonsSub ONS]

process }

2009-05-26 23:36:40.558: [AGENTUSR][2934963104][UNKNOWN] Thread:[EonsSub ONS]

process {

2009-05-26 23:36:40.563: [AGENTUSR][2924329888][UNKNOWN] Thread:[EonsSub EONS]

process {

2009-05-26 23:36:40.564: [AGENTUSR][2924329888][UNKNOWN] Thread:[EonsSub EONS]

process }

4.2.3 Event Publishers/processors in general

On one node of the cluster, the eONS subscriber of the following agents also assumes the

role of a publisher or processor or master (pick your favorite terminology):

– One dbagent's eONS subscriber assumes the role "CLSN.FAN.pommi.FANPROC"; this

subscriber is responsible for publishing ONS events (FAN events) to the HA alerts

queue for database 'pommi'. There is one FAN publisher per database in the cluster.




81

– One onsagent's eONS subscriber assumes the role "CLSN.ONS.ONSPROC", publisher

for ONS events; this subscriber is responsible for sending eONS events to ONS

clients.

– Each eonsagent's eONS subscriber on every node publishes eONS events as user

callouts. There is no single eONS publisher in the cluster. User callouts are no longer

produced by racgevtf.

The publishers/processors can be identified by searching for "got lock":

staiu01/agent/crsd/oraagent_spommere/oraagent_spommere.l01:2009-05-26

19:51:41.549: [AGENTUSR][2934959008][UNKNOWN] CssLock::tryLock, got lock

CLSN.ONS.ONSPROC


19:51:41.626: [AGENTUSR][3992972192][UNKNOWN] CssLock::tryLock, got lockCLSN.ONS.ONSNETPROC



CLSN.RLB.pommi



CLSN.FAN.pommi.FANPROC

These CSS-based locks work in such a way that any node can grab the lock if it is not already

held. If the process of the lock holder goes away, or CSS thinks the node went away, the lock

is released and someone else tries to get the lock. The different processors try to grab the

lock whenever they see an event. If a processor previously was holding the lock, it doesn't

have to acquire it again. There is currently no implementation of a "backup" or designated

failover-publisher.

4.2.4 ONSNETPROC

In a cluster of 2 or more nodes, one onsagent's eONS subscriber will also assume the role of

CLSN.ONS.ONSNETPROC, i.e. is responsible for just publishing network down events. The

publishers with the roles of CLSN.ONS.ONSPROC and CLSN.ONS.ONSNETPROC cannot and

will not run on the same node, i.e. they must run on distinct nodes.

If both the CLSN.ONS.ONSPROC and CLSN.ONS.ONSNETPROC simultaneously get their public

network interface pulled down, there may not be any event.

4.2.5 RLB publisher

Another additional thread tied to the dbagent thread in the oraagent process of only one

node in the cluster, is " Thread:[RLB:dbname]", and it dequeues the LBA/RLB/affinity event




82

from the SYS$SERVICE_METRICS queue, and publishes the event to eONS clients. It assumes

the lock role of CLSN.RLB.dbname. The CLSN.RLB.dbname publisher can run on any node,

and is not related to the location of the MMON master (who enqueues LBA events into the

SYS$SERVICE_METRICS queue. So since the RLB publisher (RLB.dbname) can run on a

different node than the ONS publisher (ONSPROC), RLB events can be dequeued on one

node, and published to ONS on another node. There is one RLB publisher per database in

the cluster

Sample trace, where Node 3 is the RLB publisher, and Node 2 has the ONSPROC role:

– Node 3:

2009-05-28 19:29:10.754: [AGENTUSR][2857368480][UNKNOWN]

Thread:[RLB:pommi] publishing message srvname = rlb

2009-05-28 19:29:10.754: [AGENTUSR][2857368480][UNKNOWN]

Thread:[RLB:pommi] publishing message payload = VERSION=1.0 database=pommi

service=rlb { {instance=pommi_3 percent=25 flag=UNKNOWN

aff=FALSE}{instance=pommi_4 percent=25 flag=UNKNOWN

aff=FALSE}{instance=pommi_2 percent=25 flag=UNKNOWN

aff=FALSE}{instance=pommi_1 percent=25 flag=UNKNOWN aff=FALSE} }

timestamp=2009-05-28 19:29:10

The RLB events will be received by the eONS subscriber of the ONS publisher

(ONSPROC) who then posts the event to ONS:

–

Node 2:2009-05-28 19:29:40.773: [AGENTUSR][3992976288][UNKNOWN] Publishing the

ONS event type database/event/servicemetrics/rlb




83

4.2.6 Example

– Node 1

o assumes role of FAN/AQ publisher CLSN.FAN.dbname.FANPROC, enqueues

HA events into HA alerts queue

o assumes role of eONS publisher to generate user callouts

MMON enqueues RLB events into SYS$SERVICE_METRICS queue

– Node 2

o assumes role of ONS publisher CLSN.ONS.ONSPROC to publish ONS and RLB

events to ONS subscribers (listener, JDBC ICC/UCP)


– Node 3

o assumes role of ONSNET publisher CLSN.ONS.ONSNETPROC to publish ONS

events to ONS subscribers (listener, JDBC ICC/UCP)


– Node 4

o assumes role of RLB publisher CLSN.RLB.dbname, dequeues RLB events

from SYS$SERVICE_METRICS queue and posts them to eONS


4.2.7 Coming up in 11.2.0.2

The above description is only valid for 11.2.0.1. In 11.2.0.2, the eONS proxy a.k.a eONS

server will be removed, and its functionality will be assumed by evmd. In addition, the

tracing as described above, will change significantly. The major reason for this change was

the high resource usage of the eONS JVM.

In order to find the publishers in the oraagent.log in 11.2.0.2, search for these patterns:

“ONS.ONSNETPROC CssLockMM::tryMaster I am the master”

“ONS.ONSPROC CssLockMM::tryMaster I am the master”

“FAN.<dbname> CssLockMM::tryMaster I am the master”

“RLB.<dbname> CssSemMM::tryMaster I am the master”




84

5 Configuration best practices

5.1 Cluster interconnect

Oracle does not recommend configuring separate interfaces for Oracle Clusterware andOracle RAC; instead, if multiple private interfaces are configured in the system, we

recommend those to be bonded to a single interface in order to provide redundancy in case

of a NIC failure. Unless bonded, multiple private interfaces provide only load balancing, not

failover capabilities.

The consequences of changing interface names depend on which name you are changing,

and whether you are also changing the IP address. In cases where you are only changing the

interface names, the consequences are minor. If you change the name for the public

interface that is stored in the OCR, then you also must modify the node applications for each

node. Therefore, you must stop the node applications for this change to take effect.

Changes made with oifcfg delif / setif for the cluster interconnect also change the privateinterconnect used by clusterware, hence an Oracle Clusterware restart is the consequence.

The interface used by the Oracle RAC (RDBMS) interconnect must be the same interface that

Oracle Clusterware is using with the hostname. Do not configure the private interconnect

for Oracle RAC on a separate interface that is not monitored by Oracle Clusterware.

See Also: Oracle Clusterware Administration and Deployment Guide, "Changing Network

Addresses on Manually Configured Networks" for more information.

5.2 misscount

As misscount is a critical value, Oracle does not support changing the default value. The

current misscount value can be checked with

# crsctl get css misscount

CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.

In case of vendor clusterware integration we set misscount to 600 in order to give the

vendor clusterware enough time to make a node join / leave decision. Never change the

default in a vendor clusterware configuration.




85

6 Clusterware Diagnostics and Debugging

6.1 Check Cluster Health

After a successful cluster installation or node startup the health of the entire cluster or a

node can be checked.

‘crsctl check has’ will check if OHASD is started on the local node and if the daemon is

running healthy.

# crsctl check has

CRS-4638: Oracle High Availability Services is online

‘crsctl check crs’ will check the OHASD, the CRSD, the ocssd and the EVM daemon.

# crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

‘crsctl check cluster –all’ will check all the daemons from all nodes belonging to that cluster.

# crsctl check cluster –all

**************************************************************

node1:




**************************************************************

node2:




**************************************************************

During startup issues monitor the output from the crsctl start cluster command; all attempts

to start a resource should be successful. If the start of a resource is failing, consult the

appropriate log file to see the errors.

# crsctl start cluster

CRS-2672: Attempting to start 'ora.cssdmonitor' on 'node1'

CRS-2676: Start of 'ora.cssdmonitor' on 'node1' succeeded

CRS-2672: Attempting to start 'ora.cssd' on 'node1'

CRS-2672: Attempting to start 'ora.diskmon' on 'node1'




86

CRS-2676: Start of 'ora.diskmon' on 'node1' succeeded

CRS-2676: Start of 'ora.cssd' on 'node1' succeeded

CRS-2672: Attempting to start 'ora.ctssd' on 'node1'

CRS-2676: Start of 'ora.ctssd' on 'node1' succeeded

CRS-2672: Attempting to start 'ora.evmd' on 'node1'

CRS-2672: Attempting to start 'ora.asm' on 'node1'

CRS-2676: Start of 'ora.evmd' on 'node1' succeeded

CRS-2676: Start of 'ora.asm' on 'node1' succeeded

CRS-2672: Attempting to start 'ora.crsd' on 'node1'

CRS-2676: Start of 'ora.crsd' on 'node1' succeeded

6.2 crsctl command line tool

It is the Oracle Clusterware management utility that has commands to manage allClusterware entities under the Oracle Clusterware framework. This includes the daemons

that are part of the Clusterware, wallet management and clusterized commands that work

on all or some of the nodes in the cluster.

You can use CRSCTL commands to perform several operations on Oracle Clusterware, such

as:

– Starting and stopping Oracle Clusterware resources

– Enabling and disabling Oracle Clusterware daemons

– Checking the health of the cluster

– Managing resources that represent third-party applications

– Integrating Intelligent Platform Management Interface (IPMI) with Oracle Clusterware to

provide failure isolation support and to ensure cluster integrity

– Debugging Oracle Clusterware components

Mostly all of the operations are cluster-wide.

See Also: Oracle Clusterware Administration and Deployment Guide, "CRSCTL Utility

Reference" for more information about using crsctl.

You can use crsctl set log commands as the root user to enable dynamic debugging for

Cluster Ready Services (CRS), Cluster Synchronization Services (CSS), and the Event Manager

(EVM), and the clusterware subcomponents. You can dynamically change debugging levels

using crsctl debug commands. Debugging information remains in the Oracle Cluster Registry

for use during the next startup. You can also enable debugging for resources.




87

A full comprehensive list of all debugging features and options is listed in the

“Troubleshooting and Diagnostic Output” section in the “Oracle Clusterware Administration

and Deployment Guide”.

6.3 Trace File Infrastructure and Location

Oracle Clusterware uses a unified log directory structure to consolidate component log files.

This consolidated structure simplifies diagnostic information collection and assists during

data retrieval and problem analysis.

Oracle Clusterware uses a file rotation approach for log files. If you cannot find the reference

given in the file specified in the "Details in" section of an alert file message, then this file

might have been rolled over to a rollover version, typically ending in *.lnumber where

number is a number that starts at 01 and increments to however many logs are being kept,

the total for which can be different for different logs. While there is usually no need to

follow the reference unless you are asked to do so by Oracle Support, you can check thepath given for roll over versions of the file. The log retention policy, however, foresees that

older logs are be purged as required by the amount of logs generated

GRID_HOME /log/<host>/diskmon – Disk Monitor Daemon

GRID_HOME /log/<host>/client – OCRDUMP, OCRCHECK, OCRCONFIG, CRSCTL – edit the

GRID_HOME /srvm/admin/ocrlog.ini file to increase the trace level

GRID_HOME /log/<host>/admin – not used

GRID_HOME/ log/<host>/ctssd – Cluster Time Synchronization Service

GRID_HOME /log/<host>/gipcd – Grid Interprocess Communication Daemon

GRID_HOME /log/<host>/ohasd – Oracle High Availability Services Daemon

GRID_HOME /log/<host>/crsd – Cluster Ready Services Daemon

GRID_HOME /log/<host>/gpnpd – Grid Plug and Play Daemon

GRID_HOME /log/<host>/mdnsd – Mulitcast Domain Name Service Daemon

GRID_HOME /log/<host>/evmd – Event Manager Daemon

GRID_HOME /log/<host>/racg/racgmain – RAC RACG

GRID_HOME /log/<host>/racg/racgeut – RAC RACG

GRID_HOME /log/<host>/racg/racgevtf – RAC RACG

GRID_HOME /log/<host>/racg – RAC RACG (only used if pre-11.1 database is installed)

GRID_HOME /log/<host>/cssd – Cluster Synchronization Service Daemon

GRID_HOME /log/<host>/srvm – Server Manager

GRID_HOME /log/<host>/agent/ohasd/oraagent_oracle11 – HA Service Daemon Agent

GRID_HOME /log/<host>/agent/ohasd/oracssdagent_root – HA Service Daemon CSS Agent

GRID_HOME /log/<host>/agent/ohasd/oracssdmonitor_root – HA Service Daemon

ocssdMonitor Agent

GRID_HOME /log/<host>/agent/ohasd/orarootagent_root – HA Service Daemon Oracle Root

Agent

GRID_HOME /log/<host>/agent/crsd/oraagent_oracle11 – CRS Daemon Oracle Agent




88

GRID_HOME /log/<host>/agent/crsd/orarootagent_root – CRS Daemon Oracle Root Agent

GRID_HOME /log/<host>/agent/crsd/ora_oc4j_type_oracle11 – CRS Daemon OC4J Agent

(11.2.0.2 feature and not used in 11.2.0.1)

GRID_HOME /log/<host>/gnsd – Grid Naming Services Daemon

6.3.1 Diagcollection

The best way to get all clusterware related traces for an incident is using

Grid_home/bin/diagcollection.pl. To get all traces and an OCRDUMP run the command as

root user “diagcollection.pl –collect –crshome <GRID_HOME >” on all nodes from the cluster

and provide support or development the collected traces.

# Grid_home/bin/diagcollection.pl

Production Copyright 2004, 2008, Oracle. All rights reserved

Cluster Ready Services (CRS) diagnostic collection tool

diagcollection

--collect

[--crs] For collecting crs diag information

[--adr] For collecting diag information for ADR

[--ipd] For collecting IPD-OS data

[--all] Default.For collecting all diag information.

[--core] UNIX only. Package core files with CRS data

[--afterdate] UNIX only. Collects archives from the specified

date. Specify in mm/dd/yyyy format

[--aftertime] Supported with -adr option. Collects archives

after the specified time. Specify in YYYYMMDDHHMISS24 format

[--beforetime] Supported with -adr option. Collects archives

before the specified date. Specify in YYYYMMDDHHMISS24 format

[--crshome] Argument that specifies the CRS Home location

[--incidenttime] Collects IPD data from the specified time.

Specify in MM/DD/YYYY24HH:MM:SS format

If not specified, IPD data generated in the past 2

hours are collected

[--incidentduration] Collects IPD data for the duration after

the specified time. Specify in HH:MM format.

If not specified, all IPD data after incidenttime are

collected

NOTE:

1. You can also do the following

./diagcollection.pl --collect --crs --crshome <CRS Home>

--clean cleans up the diagnosability




89

information gathered by this script

--coreanalyze UNIX only. Extracts information from core files

and stores it in a text file

For more information about collection of IPD data please see section 6.4.

In case of a vendor clusterware installation it is important to collect and provide all related

vendor clusterware files to Oracle Support.

6.3.2 Alert Messages Using Diagnostic Record Unique IDs

Beginning with Oracle Database 11g release 2 (11.2), certain Oracle Clusterware messages

contain a text identifier surrounded by "(:" and ":)". Usually, the identifier is part of the

message text that begins with "Details in..." and includes an Oracle Clusterware diagnostic

log file path and name similar to the following example. The identifier is called a DRUID, or

Diagnostic Record Unique ID:

2009-07-16 00:18:44.472

[/scratch/11.2/grid/bin/orarootagent.bin(13098)]CRS-5822:Agent

'/scratch/11.2/grid/bin/orarootagent_root' disconnected from server. Details at

(:CRSAGF00117:) in

/scratch/11.2/grid/log/stnsp014/agent/crsd/orarootagent_root/orarootagent_root.log

.

DRUIDs are used to relate external product messages to entries in a diagnostic log file and to

internal Oracle Clusterware program code locations. They are not directly meaningful to

customers and are used primarily by Oracle Support when diagnosing problems.

6.4 OUI / SRVM / JAVA related GUI tracing

There are several Java-based GUI tools which in case of errors should run with the following

trace levels set:

"setenv SRVM_TRACE true" (or "export SRVM_TRACE=true")

"setenv SRVM_TRACE_LEVEL 2" (or "export SRVM_TRACE_LEVEL=2")

The Oracle Universal Installer can run with the –debug flag in case of installer errors (e.g.

"./runInstaller -debug" for install).

6.5 Reboot Advisory

Oracle Clusterware may, in certain circumstances, instigate rebooting of a node to ensurethe overall health of the cluster and of the databases and other applications running on it.

The decision to reboot a node can be made by Clusterware running on that node or by

Clusterware on another node in the cluster. When the decision is made on the problematic

node, ordinary activity logging (such as the Clusterware alert log) is not reliable: time is of




90

the essence in most reboot scenarios, and the reboot usually occurs before the operating

system flushes buffered log data to disk. This means that an explanation of what led to the

reboot may be lost.

New in the 11.2 release of Oracle Clusterware is a feature called Reboot Advisory that

improves the chances of preserving an explanation for a Clusterware-initiated reboot. At

the moment a reboot decision is made by Clusterware, a short explanatory message is

produced and an attempt is made to “publish” it in two ways:

The reboot decision is written to a small file (normally on locally-attached storage) using a

“direct”, non-buffered I/O request. The file is created and preformatted in advance of the

failure (during Clusterware startup), so this I/O has a high probability of success, even on a

failing system. The reboot decision is also broadcast over all available network interfaces on

the failing system.

These operations are executed in parallel and are subject to an elapsed time limit so as notto delay the impending reboot. Attempting both disk and network publication of the

message makes it likely that at least one succeeds, and often both will. Successfully stored

or transmitted Reboot Advisory messages ultimately appear in a Clusterware alert log on

one or more nodes of the cluster.

When network broadcast of a Reboot Advisory is successful, the associated messages

appear in the alert logs of other nodes in the cluster. This happens more or less

instantaneously, so the messages can be viewed immediately to determine the cause of the

reboot. The message includes the host name of node that is being rebooted to distinguish it

from the normal flow of alert messages for that node. Only nodes in the same cluster as the

failing node will display these messages.

If the Reboot Advisory was successfully written to a disk file, when Oracle Clusterware starts

the next time on that node, it will produce messages related to the prior in the Clusterware

alert log. Reboot Advisories are timestamped and the startup scan for these files will

announce any occurrences that are less than 3 days old. The scan doesn’t empty or mark

already-announced files, so the same Reboot Advisory can appear in the alert log multiple

times if Clusterware is restarted on a node multiple times within a 3-day period.

Whether from a file or a network broadcast, Reboot Advisories use the same alert log

messages, normally two per advisory. The first is message CRS-8011, which displays the host

name of the rebooting node, a software component identifier, and a timestamp

(approximately the time of the reboot). An example looks like this:

[ohasd(24687)]CRS-8011:reboot advisory message from host: sta00129, component:

CSSMON, with timestamp: L-2009-05-05-10:03:25.340

Following message CRS-8011 will be CRS-8013, which conveys the explanatory message for

the forced reboot, as in this example:




91

[ohasd(24687)]CRS-8013:reboot advisory message text: Rebooting after limit 28500

exceeded; disk timeout 27630, network timeout 28500, last heartbeat from ocssd at

epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock

value of 93235653

Note that everything in message CRS-8013 after “text:” originates in the Clusterware

component that instigated the reboot. Because of the critical circumstances in which it is

produced, this text does not come from an Oracle NLS message file: it is always in English

language and USASCII7 character set.

In some circumstances, Reboot Advisories may convey binary diagnostic data in addition to a

text message. If so, message CRS-8014 and one or more of message CRS-8015 will also

appear. This binary data is used only if the reboot situation is reported to Oracle for

resolution.

Because multiple components can write to the Clusterware alert log at the same time, it ispossible that the messages associated with a given Reboot Advisory may appear with other

(unrelated) messages interspersed. However, messages for different Reboot Advisories are

never interleaved: all of the messages for one Advisory are written before any message for

another Advisory.

For additional information, refer to the Oracle Errors manual discussion of messages CRS-

8011 and –8013.

7 Other Tools

7.1 ocrpatchocrpatch was developed in 2005 in order provide Development and Support with a tool that

is able to fix corruptions or make other changes in OCR in the case where official tools such

as ocrconfig or crsctl were unable to handle such changes. ocrpatch is NOT being distributed

as part of the software release. The functionality of ocrpatch is already well described in a

separate document, therefore we won't go into details in this paper; the ocrpatch document

is located in the public RAC Performance Group Folder on stcontent.

7.2 vdpatch

7.2.1 Introduction

vdpatch is a new, Oracle internal tool, developed for Oracle Clusterware 11g release 2

(11.2). vdpatch pretty much uses the same code as ocrpatch, i.e. the look & feel is very

similar. The purpose of this tool is to facilitate diagnosis of CSS related issues where voting

file content is involved. vdpatch operates on a per-block basis, i.e. it can read (not write)

512-byte blocks from a voting file by block number or name. Similarly to ocrpatch, it

attempts to interpret the content in a meaningful way instead of just presenting columns of




92

hexadecimal values. vdpatch allows online (clusterware stack and ocssd running) and offline

(clusterware stack / ocssd not running) access. vdpatch works for both voting files on NAS

and in ASM. At this time, vdpatch is not actively being distributed such as ocrpatch.

Development and Support have to obtain a binary from a production ADE label.

7.2.2 General Usage

vdpatch can only be run as root, otherwise receive

$ vdpatch

VD Patch Tool Version 11.2 (20090724)

Oracle Clusterware Release 11.2.0.2.0

Copyright (c) 2008, 2009, Oracle. All rights reserved.

[FATAL] not privileged

[OK] Exiting due to fatal error ...

The filename/pathname of the voting file(s) can be obtained via 'crsctl query css votedisk'

command; note that this command only works if ocssd is running. If ocssd is not up, crsctlwill signal

# crsctl query css votedisk

Unable to communicate with the Cluster Synchronization Services daemon.

If ocssd is running, you will receive the following output:



-- ----- ----------------- --------- ---------

1. ONLINE 0909c24b14da4f89bfbaf025cd228109 (/dev/raw/raw100) [VDDG]

2. ONLINE 9c74b39a1cfd4f84bf27559638812106 (/dev/raw/raw104) [VDDG]

3. ONLINE 1bb06db216434fadbfa3336b720da252 (/dev/raw/raw108) [VDDG]


The above output indicates that there are three voting files defined in the diskgroup +VDDG,

each located on particular raw device, which is part of the ASM diskgroup. vdpatch allows

opening only ONE device at a time to read its content:

# vdpatch

VD Patch Tool Version 11.2 (20090724)

Oracle Clusterware Release 11.2.0.2.0


vdpatch> op /dev/raw/raw100

[OK] Opened /dev/raw/raw100, type: ASM

If the voting file is on a raw device, crsctl and vdpatch would show



-- ----- ----------------- --------- ---------

1. ONLINE 1de94f4db65a4f9bbf8b9bf3eba6f43b (/dev/raw/raw126) []

2. ONLINE 26d28a7311264f77bf8df6463420e614 (/dev/raw/raw130) []




93

3. ONLINE 9f862a63239b4f52bfdbce6d262dc349 (/dev/raw/raw134) []


# vdpatch

VD Patch Tool Version 11.2 (20090724)Oracle Clusterware Release 11.2.0.2.0



[OK] Opened /dev/raw/raw126, type: Raw/FS

In order to open another voting file, simply run 'op' again:




[INFO] closing voting file /dev/raw/raw126


Using the 'h' command, it will list all other available commands:

vdpatch> h

Usage: vdpatch

BLOCK operations

op <path to voting file> open voting file

rb <block#> read block by block#

rb status|kill|lease <index> read named block

index=[0..n] => Devenv nodes 1..(n-1)

index=[1..n] => shiphome nodes 1..n

rb toc|info|op|ccin|pcin|limbo read named block

du dump native block from offset

di display interpreted block

of <offset> set offset in block, range 0-511

MISC operations

i show parameters, version, info

h this help screen

exit / quit exit vdpatch

7.2.3 Common Use Case

The common use case for vdpatch is reading the content. Voting file blocks can be read by

either block number or named block type. For types TOC, INFO, OP, CCIN, PCIN and LIMBO,

there just exists one block in the voting file, so reading this one block would be done by e.g.

running 'rb toc'; the output will both show a hex/ascii dump of the 512-byte block, as well as

the interpreted content of that block:

vdpatch> rb toc

[OK] Read block 4

[INFO] clssnmvtoc block

0 73734C63 6B636F54 01040000 00020000 00000000 ssLckcoT............

20 00000000 40A00000 00020000 00000000 10000000 ....@...............

40 05000000 10000000 00020000 10020000 00020000 ....................




94

…

…

420 00000000 00000000 00000000 00000000 00000000 ....................

440 00000000 00000000 00000000 00000000 00000000 ....................

460 00000000 00000000 00000000 00000000 00000000 ....................

480 00000000 00000000 00000000 00000000 00000000 ....................

500 00000000 00000000 00000000 ............

[OK] Displayed block 4 at offset 0, length 512

[INFO] clssnmvtoc block

magic1_clssnmvtoc: 0x634c7373 - 1665954675

magic2_clssnmvtoc: 0x546f636b - 1416586091

fmtvmaj_clssnmvtoc: 0x01 - 1

fmtvmin_clssnmvtoc: 0x04 - 4

resrvd_clssnmvtoc: 0x0000 - 0

maxnodes_clssnmvtoc: 0x00000200 - 512

incarn1_clssnmvtoc: 0x00000000 - 0

incarn2_clssnmvtoc: 0x00000000 - 0

filesz_clssnmvtoc: 0x0000a040 - 41024

blocksz_clssnmvtoc: 0x00000200 - 512

hdroff_clssnmvtoc: 0x00000000 - 0

hdrsz_clssnmvtoc: 0x00000010 - 16

opoff_clssnmvtoc: 0x00000005 - 5

statusoff_clssnmvtoc: 0x00000010 - 16

statussz_clssnmvtoc: 0x00000200 - 512

killoff_clssnmvtoc: 0x00000210 - 528

killsz_clssnmvtoc: 0x00000200 - 512

leaseoff_clssnmvtoc: 0x0410 - 1040

leasesz_clssnmvtoc: 0x0200 - 512

ccinoff_clssnmvtoc: 0x0006 - 6

pcinoff_clssnmvtoc: 0x0008 - 8

limbooff_clssnmvtoc: 0x000a - 10

volinfooff_clssnmvtoc: 0x0003 - 3

For block types STATUS, KILL and LEASE, there exists one block per defined cluster node, sothe 'rb' command needs to be used in combination with an index that denotes the node

number. In a Development environment, the index starts with 0, while in a

shiphome/production environment, the index starts with 1. So in order to read the 5th

node's KILL block in a Development environment, submit 'rb kill 4', while in a production

environment, use 'rb kill 5'.

Example to read the STATUS block of node 3 (here: staiu03) in a Development environment:

vdpatch> rb status 2

[OK] Read block 18

[INFO] clssnmdsknodei vote block

0 65746F56 02000000 01040B02 00000000 73746169 etoV............stai20 75303300 00000000 00000000 00000000 00000000 u03.................

40 00000000 00000000 00000000 00000000 00000000 ....................

60 00000000 00000000 00000000 00000000 00000000 ....................

80 00000000 3EC40609 8A340200 03000000 03030303 ....>....4..........

100 00000000 00000000 00000000 00000000 00000000 ....................

120 00000000 00000000 00000000 00000000 00000000 ....................




95

140 00000000 00000000 00000000 00000000 00000000 ....................

160 00000000 00000000 00000000 00000000 00000000 ....................

180 00000000 00000000 00000000 00000000 00000000 ....................

200 00000000 00000000 00000000 00000000 00000000 ....................

220 00000000 00000000 00000000 00000000 00000000 ....................

240 00000000 00000000 00000000 00000000 00000000 ....................

260 00000000 00000000 00000000 00000000 00000000 ....................

280 00000000 00000000 00000000 00000000 00000000 ....................

300 00000000 00000000 00000000 00000000 00000000 ....................

320 00000000 00000000 00000000 00000000 00000000 ....................

340 00000000 00000000 00000000 8E53DF4A ACE84A91 .............S.J..J.

360 E4350200 00000000 03000000 441DDD4A 6051DF4A .5..........D..J`Q.J

380 00000000 00000000 00000000 00000000 00000000 ....................

400 00000000 00000000 00000000 00000000 00000000 ....................

420 00000000 00000000 00000000 00000000 00000000 ....................

440 00000000 00000000 00000000 00000000 00000000 ....................

460 00000000 00000000 00000000 00000000 00000000 ....................

480 00000000 00000000 00000000 00000000 00000000 ....................

500 00000000 00000000 00000000 ............

[OK] Displayed block 18 at offset 0, length 512

[INFO] clssnmdsknodei vote block

magic_clssnmdsknodei: 0x566f7465 - 1450144869

nodeNum_clssnmdsknodei: 0x00000002 - 2

fmtvmaj_clssnmdsknodei: 0x01 - 1

fmtvmin_clssnmdsknodei: 0x04 - 4

prodvmaj_clssnmdsknodei: 0x0b - 11

prodvmin_clssnmdsknodei: 0x02 - 2

killtime_clssnmdsknodei: 0x00000000 - 0

nodeName_clssnmdsknodei: staiu03

inSync_clssnmdsknodei: 0x00000000 - 0

reconfigGen_clssnmdsknodei: 0x0906c43e - 151438398

dskWrtCnt_clssnmdsknodei: 0x0002348a - 144522

nodeStatus_clssnmdsknodei: 0x00000003 - 3

nodeState_clssnmdsknodei[CLSSGC_MAX_NODES]:node 0: 0x03 - 3 - MEMBER

node 1: 0x03 - 3 - MEMBER



timing_clssnmdsknodei.sts_clssnmTimingStmp: 0x4adf538e - 1256149902 - Wed Oct 21

11:31:42 2009

timing_clssnmdsknodei.stms_clssnmTimingStmp: 0x914ae8ac - 2437605548

timing_clssnmdsknodei.stc_clssnmTimingStmp: 0x000235e4 - 144868

timing_clssnmdsknodei.stsi_clssnmTimingStmp: 0x00000000 - 0

timing_clssnmdsknodei.flags_clssnmdsknodei: 0x00000003 - 3

unique_clssnmdsknodei.eptime_clssnmunique: 0x4add1d44 - 1256004932 - Mon Oct 19

19:15:32 2009

ccinid_clssnmdsknodei.cin_clssnmcinid: 0x4adf5160 - 1256149344 - Wed Oct 21

11:22:24 2009

ccinid_clssnmdsknodei.unique_clssnmcinid: 0x00000000 - 0

pcinid_clssnmdsknodei.cin_clssnmcinid: 0x00000000 - 0 - Wed Dec 31 16:00:00 1969

pcinid_clssnmdsknodei.unique_clssnmcinid: 0x00000000 - 0




96

We do not plan to allow vdpatch making any changes to a voting file. The only

recommended way of modifying voting files is to drop and recreate them using the crsctl

command.

7.3 Appvipcfg – adding an application VIP

In 11.2 the creation and deletion from an application or uservip can be managed via

Grid_home/bin/appvipcfg

Production Copyright 2007, 2008, Oracle.All rights reserved

Usage: appvipcfg create -network=<network_number> -ip=<ip_address>

-vipname=<vipname>

-user=<user_name>[-group=<group_name>]

delete -vipname=<vipname>

The appvipcfg command line tool can only create an application VIP on the default network

for which the resource ora.net1.network is created by default. If someone needs to createan application VIP on a different network or subnet this must be done manually.

EXAMPLE of creating a uservip on a different network (ora.net2.network)

srvctl add vip -n node1 -k 2 -A appsvip1/255.255.252.0/eth2

crsctl add type coldfailover.vip.type -basetype ora.cluster_vip_net2.type

crsctl add resource coldfailover.vip -type coldfailover.vip.type -attr \

"DESCRIPTION=USRVIP_resource,RESTART_ATTEMPTS=0,START_TIMEOUT=0, STOP_TIMEOUT=0, \

CHECK_INTERVAL=10, USR_ORA_VIP=10.137.11.163, \

START_DEPENDENCIES=hard(ora.net2.network)pullup(ora.net2.network), \

STOP_DEPENDENCIES=hard(ora.net2.network), \

ACL='owner:root:rwx,pgrp:root:r-x,other::r--,user:oracle11:r-x'"

There are a couple known bugs around that area and for tracking purpose and completeness

we will list them:

– 8623900 srvctl remove vip -i <ora.vipname> is removing the associated

ora.netx.network

– 8620119 appvipcfg should be expanded to create a network resource

– 8632344 srvctl modify nodeapps -a will modify the vip even if the interface is not

valid

– 8703112 appsvip should have the same behavior as ora.vip like vip failback

– 8758455 uservip start failed and orarootagent core dump in clsn_agent::agentassert

– 8761666 appsvipcfg should respect /etc/hosts entry for apps ip even if gns is

configured




97

– 8820801 using a second network (k 2) I’m able to add and start the same ip twice

7.4 Application and Script Agent

The application or script agent manages the application/resource through the application

specific user code. Oracle Clusterware contains a special shared library (libagfw) which

allows users to plug-in application specific actions using a well defined interface.

The following sections describe how to build an agent using Oracle Clusterware's agent

framework interface.

7.4.1 Action Entry Points

Action entry points refer to user defined code that needs to be executed whenever an

action has to be taken on a resource (start resource, stop resource etc.). For every resource

type, Clusterware requires that action entry points are defined for the following actions:

start : Actions to be taken to start the resource

stop : Actions to gracefully stop the resource

check : Actions taken to check the status of the resource

clean : Actions to forcefully stop the resource.

These action entry points can be defined using C++ code or in script. If any of these actions

are not explicitly defined, Clusterware assumes by default that they are defined in a script.

This script is located via the ACTION_SCRIPT attribute for the resource type. Hence it is

possible to have hybrid agents, which define some action entry points using script and other

action entry points using C++. It is possible to define action entry points for other actions too

(e.g. for changes in attribute value) but these are not mandatory.

7.4.2 Sample Agents

Consider a file as the resource that needs to be managed by Clusterware. An agent that

manages this resource has the following tasks:

On startup : Create the file.

On shutdown : Gracefully delete the file.

On check command: Detect whether the file is present or not.

On clean command: Forcefully delete the file.

To describe this particular resource to Oracle Clusterware, a specialized resource type is first

created, that contains all the characteristic attributes for this resource class. In this case, the

only special attribute to be described is the filename to be monitored. This can be done withthe crsctl command. While defining the resource type, we can also specify the

ACTION_SCRIPT and AGENT_FILENAME attributes. These are used to refer to the shell script

and executables that contain the action entry points for the agents.




98

Once the resource type is defined, there are several options to write a specialized agent

which does the required tasks - the agent could be written as a script, as a C/C++ program or

as a hybrid.

Examples for each of them are given below.

7.4.3 Shell script agent

The file Grid_home/crs/demo/demoActionScript is a shell script which already contains all

the required action entry points and can act as an agent for the file resource. To test this

script, the following steps need to be performed:

(1) Start the Clusterware installation.

(2) Add a new resource type using the crsctl utility as below

$ crsctl add type test_type1 -basetype cluster_resource -attr \

"ATTRIBUTE=PATH_NAME,TYPE=string,DEFAULT_VALUE=default.txt" -attr \

"ATTRIBUTE=ACTION_SCRIPT,TYPE=string,DEFAULT_VALUE=/path/to/demoActionScript"

Modify the path to the file appropriately. This adds a new resource type to Clusterware.

Alternately, the attributes can be added in a text file which is passed as a parameter to the

CRSCTL utility.

(3) Add new resources to the cluster using the crsctl utility. The commands to do this are:

$ crsctl add resource r1 -type test_type1 -attr "PATH_NAME=/tmp/r1.txt"

$ crsctl add resource r2 -type test_type1 -attr "PATH_NAME=/tmp/r2.txt"

Modify the PATH_NAME attribute for the resources as needed. This adds resources named

r1 and r2 to be monitored by clusterware. Here we are overriding the default value for the

PATH_NAME attribute for our resources.

(4) Start/stop the resources using the crsctl utility. The commands to do this are:

$ crsctl start res r1


$ crsctl check res r1

$ crsctl stop res r2

The files /tmp/r1.txt and /tmp/r2.txt get created and deleted as the resources r1 and r2 get

started and stopped.

7.4.4 Option 2: C++ agentOracle provides a demoagent1.cpp in the Grid_home/crs/demo directory. The

demoagent1.cpp is a sample C++ program that has similar functionality to the shell script

above. This program also monitors a specified file on the local machine. To test this

program, the following steps need to be performed:




99

(1) Compile the C++ agent using the provided source file demoagent1.cpp and makefile.

The makefile needs to be modified based on the local compiler/linker paths and install

locations. The output will be an executable named demoagent1

(2) Start the Clusterware


$ crsctl add type test_type1 -basetype cluster_resource \

-attr "ATTRIBUTE=PATH_NAME,TYPE=string,DEFAULT_VALUE=default.txt" \

-attr "ATTRIBUTE=AGENT_FILENAME,TYPE=string,DEFAULT_VALUE=/path/to/demoagent1"

Modify the path to the file appropriately. This adds a new resource type to Clusterware.

(4) Create a new resource based on the type that is defined above. The commands are as

follows:

$ crsctl add res r3 -type test_type1 -attr "PATH_NAME=/tmp/r3.txt"


This adds resources named r3 and r4 to be monitored by Clusterware.

(5) Start/stop the resource using the CRSCTL utility. The commands to do so are:

$ crsctl start res re3




The files /tmp/r3.txt and /tmp/r4.txt get created and deleted as the resources get started

and stopped.

7.4.5 Option 3: Hybrid agent

The Grid_home/crs/demo/demoagent2.cpp is a sample C++ program that has similar

functionality to the shell script above. This program also monitors a specified file on the local

machine. However, this program defines only the CHECK action entry point - all other action

entry points are left undefined and are read from the ACTION_SCRIPT attribute. To test this

program, the following steps need to be performed:

(1) Compile the C++ agent using the provided source file demoagent2.cpp and makefile.

The makefile needs to be modified based on the local compiler/linker paths and install

locations. The output will be an executable named demoagent2.

(2) Start the Clusterware


$ crsctl add type test_type1 -basetype cluster_resource \

-attr "ATTRIBUTE=PATH_NAME,TYPE=string,DEFAULT_VALUE=default.txt" \




100

-attr "ATTRIBUTE=AGENT_FILENAME,TYPE=string,DEFAULT_VALUE=/path/demoagent2" \

-attr "ATTRIBUTE=ACTION_SCRIPT,TYPE=string,DEFAULT_VALUE=/path/demoActionScript"

Modify the path to the files appropriately. This adds a new resource type to Clusterware.

(4) Create new resources based on the type that is defined above. The commands are as

follows:



This adds resources named r5 and r6 to be monitored by Clusterware.

(5) Start/stop the resource using the CRSCTL utility. The commands to do so are:





The files /tmp/r5.txt and /tmp/r6.txt get created and deleted as the resources get started

and stopped.

7.5 Oracle Cluster Health Monitor - OS Tool (IPD/OS)

7.5.1 Overview

This tool (formerly known as Instantaneous Problem Detection tool) is designed to detect

and analyze operating system (OS) and cluster resource related degradation and failures in

order to bring more explanatory power to many Oracle Clusterware and Oracle RAC issues

such as node eviction.

It continuously tracks the OS resource consumption at node, process, and device level. It

collects and analyzes the cluster-wide data. In real time mode, when thresholds are hit, an

alert is shown to the operator. For root cause analysis, historical data can be replayed to

understand what was happening at the time of failure.

The tool installation is pretty simple and is described in the README shipped with the zipfile.

The latest version is uploaded on OTN for Linux and NT under the following link

http://www.oracle.com/technology/products/database/clustering/ipd_download_homepag

e.html




101

7.5.2 Install the Oracle Cluster Health Monitor

In order to install the tool on a list of nodes run the following basic steps (for more detailed

information read the REAMDE):

– Unzip the package

– Create user crfuser:oinstall on all nodes

– Make sure crfuser’s home is same on all nodes

– Set up password-less ssh for crfuser on all nodes

– Login as crfuser and run crfinst.pl with appropriate options

– To finalize install, login as root and run crfinst.pl –f on all installed nodes

– CRF_home is set to /usr/lib/oracrf on Linux

7.5.3 Running the OS Tool stack

The OS tool must be started via /etc/init.d/init.crfd start. This command spawns the

osysmond process which spawns the ologgerd daemon. The ologgerd then picks a replica

node (if >= 2 nodes) and informs the osysmond on that node to spawn the replica ologgerd.

The OS Tool stack can be shutdown on a node as follows:

# /etc/init.d/init.crfd disable

7.5.4 Overview of Monitoring Process (osysmond)

The osysmond (one daemon per cluster node) will perform the following steps to collect the

data:

– Monitors and gathers system metrics periodically

– Runs as real time process

– Runs validation rules against the system metrics

– Marks color coded alerts based on thresholds

– Sends the data to the master Logger daemon

– Logs data to local disk in case of failure to send

The osysmond will alert on perceived node-hangs (under-utilized resources despite many

potential consumer tasks)

– CPU usage < 5%

– CPU Iowait > 50%

– MemFree < 25%




102

– # Disk IOs persec < 10% of max possible Disk IOs persec

– # bytes of outbound n/w traffic limited to data sent by SYSMOND

– # tasks node-wide > 1024

7.5.5 CRFGUI

The Oracle Cluster Health Monitor is shipped with two data retrieval tools one is the crfgui

which is the main GIU display.

Crfgui connects to the local or remote master LOGGERD. Is the GUI installed inside the

cluster it auto detects the LOGGERD, otherwise, running outside the cluster a cluster node

must be specified with the ‘-m’ switch.

The GUI alerts critical resource usage events and perceived system hangs. After starting it

we support different GUI views like cluster view, node view and device view.

Usage: crfgui [-m <node>] [-d <time>] [-r <sec>] [-h <sec>]

[-W <sec>] [-i] [-f <name>] [-D <int>]

-m <node> Name of the master node (tmp)

-d <time> Delayed at a past time point

-r <sec> Refresh rate

-h <sec> Highlight rate

-W <sec> Maximal poll time for connection

-I interactive with cmd prompt

-f <name> read from file, ".trc" added if no suffix

given

-D <int> sets an internal debug level

7.5.6 oclumon

A command line tool is included in the package which can be used to query the Berkeley DB

backend to print out to the terminal the node specific metrics for a specified time period.

The tool also supports a query to print the durations and the states for a resource on a node

during a specified time period. These states are based on predefined thresholds for each

resource metric and are denoted as red, orange, yellow and green indicating decreasing

order of criticality. For example, you could ask to show how many seconds did the CPU on

node "node1" remain in RED state during the last 1 hour. Oclumon can also be used to

perform miscellaneous administrative tasks such as changing the debug levels, querying

version of the tool, changing the metrics database size, etc.The usage of the oclumon can be printed by oclumon –h. To get more information about

each verb option run oclumon <verb> -h.

Currently supported verbs are:

showtrail, showobjects, dumpnodeview, manage, version, debug, quit and help




103

Below some useful attribute examples that can be passed to oclumon. The default location

for oclumon is /usr/lib/oracrf/bin/oclumon.

Showobjects

oclumon showobjects –n node –time "2009-10-07 15:11:00"

Dumpnodeview

oclumon dumpnodeview –n node

Showgaps

oclumon showgaps -n node1 -s "2009-10-07 02:40:00" \

-e "2009-10-07 03:59:00"

Number of gaps found = 0

Showtrail

oclumon showtrail -n node1 -diskid sde qlen totalwaittime \

-s "2009-07-09 03:40:00" -e "2009-07-09 03:50:00" \

-c "red" "yellow" "green"

Parameter=QUEUE LENGTH

2009-07-09 03:40:00 TO 2009-07-09 03:41:31 GREEN

2009-07-09 03:41:31 TO 2009-07-09 03:45:21 GREEN

2009-07-09 03:45:21 TO 2009-07-09 03:49:18 GREEN

2009-07-09 03:49:18 TO 2009-07-09 03:50:00 GREEN

Parameter=TOTAL WAIT TIME

oclumon showtrail -n node1 -sys cpuqlen -s \

"2009-07-09 03:40:00" -e "2009-07-09 03:50:00" \

-c "red" "yellow" "green"

Parameter=CPU QUEUELENGTH

2009-07-09 03:40:00 TO 2009-07-09 03:41:31 GREEN

2009-07-09 03:41:31 TO 2009-07-09 03:45:21 GREEN

2009-07-09 03:45:21 TO 2009-07-09 03:49:18 GREEN

2009-07-09 03:49:18 TO 2009-07-09 03:50:00 GREEN

7.5.7 What to collect for cluster related issues

With Oracle Clusterware 11g release 2 the Grid_home/bin/diagcollection.pl is collecting

Oracle Cluster Health Monitor data as well if found it installed on a cluster, which is

recommended by Oracle.

To collect the data after a hang or node eviction, to analyze the issue, perform the followingsteps.

– Run the 'Grid_home/bin/diagcollection.pl --collect --ipd --incidenttime <inc time> --

incidentduration <duration>' command on the IPD master, LOGGERD node, where --

incidenttime format is MM/DD/YYYY24HH:MM:SS, and --incidentduration is HH:MM




104

– Identify the LOGGERD node using the

/usr/lib/oracrf/bin/oclumon manage -getkey "MASTER=" command. Starting with

11.2.0.2 the oclumon will be in the Grid_home/bin directory.

– Collect data for at least 30 min before and after the incident.

masterloggerhost:$./bin/diagcollection.pl --collect --ipd --incidenttime

10/05/200909:10:11 --incidentduration 02:00

Starting with 11.2.0.2 and the CRS integrated IPD/OS the syntax to get the IPD data

collected is "masterloggerhost:$./bin/diagcollection.pl --collect --crshome

/scratch/grid_home_11.2/ --ipdhome /scratch/grid_home_11.2/ --ipd --

incidenttime 01/14/201001:00:00 --incidentduration 04:00"

– The IPD data file which will look like:

ipdData_<hostname>_<curr time>.tar.gz

ipdData_node1_20091006_2321.tar.gz

– How long does it take to run diagcollect?

4 node cluster, 4 hour data - 10 min

32 node cluster, 1 hour data - 20 min

7.5.8 Debugging

In order to turn on debugging for the osysmond or the loggerd run ‘oclumon debug log all

allcomp:5’ as root user. This will turn on debugging for all components.

Starting with 11.2.0.2 the IPD/CHM log files will be under

Grid_home/log/<hostname>/crfmond

Grid_home/log/<hostname>/crfproxy

Grid_home/log/<hostname>/crflogd

7.5.9 For ADE users

Installation and start of IPD/OS in a development environment is simpler:

$ cd crfutl && make setup && runcrf

osysmond usually starts immediately, while it may take seconds (minutes if your I/O

subsystem is slow) for ologgerd and oproxyd to start due to the initialization of the Berkeley

Database (bdb). First node to call 'runcrf' will be configured as master. First node after the

master to run 'runcrf' will be configured as replica. From there on, things will move if

required. Daemons to look out for are: osysmond (on all nodes), ologgerd (on master andreplica nodes), oproxyd (on all nodes).

In a development environment, the IPD/OS processes do not run as root or in real time.




105

7.5.10 11.2.0.2

- The oproxyd process may or may not exist anymore. As of time of publication of this

document, the oproxyd process is disabled.

- IPDOS will be represented by the OHASD resource ora.crf, and the need for manual

installation and configuration for both development and production environments

will be eliminated.




106

8 Appendix

References

Oracle Clusterware 11g Release 2 (11.2) – Using standard NFS to support a third

voting file for extended cluster configurations

Grid Infrastructure Installation Guide 11g Release 2 (11.2)

Clusterware Administration and Deployment Guide 11g Release 2 (11.2)

Storage Administrator's Guide 11g Release 2 (11.2)

Oracle Clusterware 11g Release 2 Technical Overview

http://www.oracle.com/technology/products/database/clustering/ipd_download_h

omepage.html

Functional Specification for CRS Resource Modeling Capabilities, Oracle

Clusterware, 11gR2

Useful Notes

Note 294430.1 - CSS Timeout Computation in Oracle Clusterware

Note 1050693.1 - Troubleshooting 11.2 Clusterware Node Evictions (Reboots)

Note 1053010.1 - How to Dump the Contents of an Spfile on ASM when ASM/GRID

is down

Note 338706.1 - Oracle Clusterware (formerly CRS) Rolling Upgrades

Note:785351.1 - Upgrade Companion 11g Release 2

http://www.oracle.com/technology/products/database/oracle11g/upgrade/index.h

tml



Documents

11gR2 Clusterware Technical Wp