Upload
dangkien
View
218
Download
2
Embed Size (px)
Citation preview
OutlineOverview of High Availability (HA)
Highly Available NNM
Highly Available ITO
Management of HA clusters
Integrated products
Performance products
Demand for HA solutionsShift in focus on NSM solutions in themarketplace
– NSM core IT department function– IT departments accountable to business units– Driven by Service Level Agreements (SLAs)– Impact of reporting
Highly Available NNMCollection station failover
– Implemented in Distributed Internet Discovery& Monitoring (DIM)
– Allows a Management Station (MS) to pick upstatus polling responsibility for a failedCollection Station (CS)
Highly Available cluster– Allows NNM to run on a cluster of 2 or more
servers that provide continuous availability
Distributed ArchitectureOverview
NNM Station A(Management
Station)
NNM Station B(Collection
Station)
NNM Station C(Collection
Station)
NNM Station n(Collection
Station)
Discovery & Status Pollingoccurs locally
Changes in status andtopology are relayed from thecollection stations (stations Bthrough n) to managementstation A.
Failover ConfigurationFailover filter
– Allows specified objects to be polled by MSduring CS failure
Enabling CS failover– xnmtopoconf -failover {stationlist}
Applying failover filter– xnmtopoconf -failoverFilter {filter} {stationlist}
Limitations of DIM for HAScaling
– Limited by number of devices polled and thepolling interval
– Practical size of object database (primary &secondary objects)
– Capacity of network connection between MS &CS
– Event data may be lost since SNMP traps arenot forwarded when CS fails
– Multiple CS failures may create problems for MS
Highly Available ITOMultiple Managers
– Manager of Managers– Peer Managers– Follow-the-Sun
Hardware Backup– Cold Standby– Highly Available Cluster
Multiple ManagersManager of Managers (MoM)
– Single ITO server manages multiple ITO serversPeer Managers
– Multiple managers with a designatedresponsibility providing redundancy to oneanother
Follow-the-Sun– Management responsibility moves based on
time of day
Limitations of MultipleManagers
Scalability– Number of managed nodes per ITO server– Number of messages received in the ITO
browser– Number of operator logons– Speed of network connections to remote sites– NNM configuration issues– Agents must be told to report to new ITO server
(time issue)
Hardware RedundancyCold Backup
– Cost effective• 1 server backs up multiple ITO servers• Shared storage device not required
– Downtime may be unacceptable• Configuration of failed server loaded after
failover– Message issues
• No synchronization of current message data• Latency detecting events while message
buffers are cleared– Need to implement configuration
synchronization
Hardware RedundancyHot Standby
– ITO servers implemented in HA cluster– Rapid, automated failover of a failed ITO server– Configuration and messages data shared
between nodes– Upgrades & patches require more effort to
install– Costly solution
• Requires shared disk array• MC / ServiceGuard software required
– Can provide LAN failover
Basic HA DefinitionsPackage - application and associated processes
can only run on 1 node in the cluster
Service - a process monitored by MC/SG
Original Node - node where package existedbefore failover
Adoptive Node - node that takes control of apackage
Overview of a ClusterLAN 1
LAN 2
Node 1 Node 2
SharedData
PackageA
IP: 10.1.50.25IP: 10.1.50.26
Virtual IP:10.1.50.27
Bridge
Originating Node Adoptive Node
Heart Beat LAN
Steps to Implement HACreate a volume group on shared deviceCreate logical volumes in that group
– /etc/opt/OV/share– /var/opt/OV/share– /opt/OV/OpC_SG– /u01/oradata/OpenView (DB files)– /u01/app/oracle/product (DB binaries)
Create fully qualified hostname & IP address forITO package
Activate & mount volume group on primary node– vgchange -a y /dev/{volume_group}
Steps to Implement HAInstall Oracle binaries on primary nodeInstall ITO binaries on primary nodeInstall latest ITO/NNM patches on primary nodeConfigure the ITO database (opcconfig)
– Select MC/SG installation– Use fully qualified name for package– Shared lvol is /opt/OV/OpC_SG– Configure DB automatically– Do not enable startup at boot time
Configure startup of ITO processes manually in$OV_LRF directory:– ovaddobj ovoacomm.lrf– ovaddobj opc.lrf
Steps to Implement HAModify /etc/oratab to enable autostartVerify ITO/NNM startsModify ov.conf file
– Clean copy - /opt/OV/newconfig/OVNNM-RUN/conf/ov.conf– Create ov.conf.host1 & ov.conf.host2
• Modify HOSTNAME= field in each to match the localhostname
• Modify NNM_INTERFACE= to match floating IP• Modify USE_LOOPBACK= to ON
Modify NNM auth files– ovw.auth– ovwdb.auth– ovspmd.auth
Steps to Implement HAVerify ITO/NNM starts
– First copy ov.conf.host1 to ov.confInstall bits on 2nd node
– Do not run opcconfig– Verify operation of NNM
Modify opcsvinfo file on both to include lines:– OPC_SG TRUE– OPC_SG_NNM TRUE
Switch shared volume to 2nd node– Shutdown OpenView & Oracle– Unmount all 5 volumes– Deactivate the shared volume group– Activate VG & mount lvols on 2nd node
Steps to Implement HACopy the following to the 2nd node:
– /etc/oratab– /etc/opt/OV/share/conf/ovdbconf
Create new server registration file:– cd /etc/opt/OV/share/conf/OpC/mgmt_sv– mv svreg svreg.OLD– touch svreg– opt/OV/bin/OpC/install/opcsvreg -add itosvr.reg– rm svreg.OLD
Remove ovserver file:– rm /var/opt/OV/share/databases/openview/ovwdb/ovserver
Steps to Implement HARun opcconfig
– Do not configure the DB automaticallyCreate ITO packageModify monitor scripts to watch necessary
OpenView processesConfigure package control scriptsVerify operation of ITO on both cluster nodes
independentlyTest failover by killing monitored process
Configuration Filesito.ctl
– Package control file, contains steps necessaryto activate/deactivate ITO in failover
ito.mon– Describes processes that the package monitors
ito_create_new_svreg– Creates new server reg file on failover, invoked
by ito.ctlito_start_sgtrapi & ito_stop_sgtrapi
– Starts & stops trap interceptor on ITO server
Common ProblemsFailure to modify opcsvinfo file
– OPC_SG_NNM_STARTUP TRUE– OPC_SG TRUE
Failure to remove ovserver file– If in doubt remove & let it get re-created
Failure to modify NNM auth filesFailure to modify the ov.conf file
– HOSTNAME, NNM_INTERFACE fields
Keep in MindPatches must be applied to each nodeKey commands
– cmhaltpkg ITO– cmrunpkg -n {node} -v ITO– cmviewcl– cmmodpkg -e ITO– cmrunnode {node}
An ovstop will cause switchoverProcesses to monitor for failover
Managing HA ClustersManaging the Management Cluster
– Trap Template– Log files– Process monitors
Managing HA packages– Process monitors– Log file issues
Managing HA ClustersShared log files
– Monitored from active node– Copy contents to log.node on failover, start with
clean slate– Use “close after read” to avoid corruption– Use “read from last file position”– Do not use “message on no logfile”
Monitors– Intelligent monitors that read output of
cmviewcl to determine state
Managing HA ClustersTrap interceptor
– Assign to virtual node– Started up on active node with ito_start_sgtrapi– Stopped on inactive node with ito_stop_sgtrapi
Notification & TroubleTicketing in HA clusters
Connect to TT using virtual hostname– Databases will be synched– Intelligence needed to move SPI monitoring– Include startup/shutdown of SPI in package
control scriptsHW Notification requires connection to each
serverSW based use virtual hostname to connect
PerfView & MeasureWare inHA Environments
Management Server– PerfView binaries loaded on every cluster node– Use shared data repository as MW data
collection– Each node connects to MW agent as necessary
Managed Node– MW agent runs on each cluster node to collect
performance data– Application data collected on cluster node that
currently runs the package