29
ORACLE CLUSTERWARE AND PRIVATE NETWORK CONSIDERATIONS- PRACTICAL PERFORMANCE MANAGEMENT FOR ORACLE RAC 6/23/22 1 Guenadi Nedkov Jilevski

Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

Embed Size (px)

Citation preview

Page 1: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 1

ORACLE CLUSTERWARE AND PRIVATE NETWORK CONSIDERATIONS- PRACTICAL PERFORMANCE MANAGEMENT FOR ORACLE RAC

Guenadi Nedkov Jilevski

Page 2: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 2

AGENDA

Oracle RAC Fundamentals and Infrastructure.

Analysis of Cache fusion Impact on RAC. Private Interconnect Considerations. Aggregation. Common known Problems and

Symptoms - from cache fusion wait events and statistics.

Diagnostics and Problem troubleshooting.

Q and A

Page 3: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 3

ORACLE RAC FUNDAMENTALS AND INFRASTRUCTURE

Oracle RAC Architecture

Page 4: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 4

ORACLE RAC FUNDAMENTALS AND INFRASTRUCTURE

Function and Processes of Global Enqueue Services (GES) and Global Cache Services (GCS)

Page 5: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 5

ORACLE RAC FUNDAMENTAL AND INFRASTRUCTURE

Global Buffer Cache

Page 6: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 6

ANALYZING CACHE FUSION IMPACT IN RAC

The cost of block access and cache coherency is represented by: Global Cache services statistics Global Cache Services wait events

The response time for cache fusion transfers is determined by: Overhead by the physical interconnect components IPC protocol GCS protocol

The response time is not generally affected by disk I/O factors except for the occasional log write done when sending a dirty buffer to another instance in a write-read or write-write situation

Page 7: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 7

ANALYZING CACHE FUSION IMPACT ON RAC

Typical Latencies for RAC Operations

•CR block request time = build time + flush time + send time•Current block request time = pin time + flash time + send time•Latencies from V$SYSSTAT•Other Latencies may be seen in V$SEG_STATISTICS

Page 8: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 8

ANALYZING CACHE FUSION IMPACT ON RACWait Events for RAC

Wait events help to analyze what sessions are waiting for. Wait times are attributed to events that reflect the

outcome of a request: Placeholders while waiting – wait_time = 0 Placeholders after waiting – wait_time != 0

Global cache waits are summarized in a broader category called Cluster Wait Class.

These wait events are used in ADDM to enable Cache Fusion diagnostics.

Page 9: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 9

ANALYZING CACHE FUSION IMPACT ON RAC

Wait Events Views

Page 10: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 10

ANALYZING CACHE FUSION IMPACT ON RAC

Global Cache Wait Events: Overview

Page 11: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 11

ANALYZING CACHE FUSION IMPACT ON RAC

2 – way Block Request: Example

Page 12: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 12

ANALYZING CACHE FUSION IMPACT ON RAC

3-way Block Request: Example

Page 13: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 13

ANALYZING CACHE FUSION IMPACT ON RAC

2-way Grant : Example

Page 14: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 14

ANALYZING CACHE FUSION IMPACT ON RAC

Enqueues are synchronous. Enqueues are global resources in RAC The most frequent wait are for:

TX – row wait locks or ITL waits TM – Table Manipulation Enqueue TA – Transaction Recovery Enqueue SQ – Sequence generation Enqueue HW – High Watermark Enqueue US – Undo Segment Enqueue to manage undo

segment extensions. The waits may constitute serious serialization point

Global Enqueue Waits: Overview

Page 15: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 15

ANALYZING CACHE FUSION IMPACT ON RAC

Use V$SYSSTAT to characterize the workload. Use V$SESSSTAT to monitor important sessions. V$SEGMENT_STATISTICS includes RAC statistics. RAC relevant statistics group are:

Global Cache Service statistics Global Enqueue Service statistics Statistics for messages send

V$ENQUEUE_STATISTICS determines the enqueue with the highest impact.

V$INSTANCE_CACHE_TRANSFER breaks down GCS statistics into block classes.

Session and System Statistics

Page 16: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 16

PRIVATE INTERCONNECT CONSIDERATIONS

IPC Configuration

Page 17: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 17

PRIVATE INTERCONNECT CONSIDERATIONS

Infrastructure Network Packet Processing

Page 18: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 18

PRIVATE INTERCONNECT CONSIDERATIONS

Network Packet Processing: Layers, Queues and Buffers

Page 19: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 19

PRIVATE INTERCONNECT CONSIDERATIONS

Network between the nodes of a RAC cluster must be private. NIC to have the same name across all the nodes in the RAC

cluster. Supported links: Gbe, IB Supported transport protocols: UDP, RDS Use multiple or dual-ported NICs for redundancy (HA), load

balancing, load spreading and increase bandwidth with NIC bonding/aggregation.

Large ( Jumbo ) Frames for Gbe recommended if the global cache workload requires it.

Bandwidth requirements depend on several factors ( e.g. buffer cache size, #of CPUs per node, access patterns) and cannot be predicted precisely for every application

For OLTP 1Gb/sec usually is sufficient for performance and scalability.

DSS/DW systems should be designed with > 1Gb/sec capacity

Infrastructure: Private Interconnect

Page 20: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 20

PRIVATE INTERCONNECT CONSIDERATIONS

Important Settings: Negotiated top bit rate and full duplex mode NIC ring buffers Ethernet flow control settings CPU(s) receiving network interrupts

Verify your setup: CVU does checking Load testing eliminates potential for problems AWR and ADDM give estimations of link utilization

Buffer overflows, congested links and flow control can have severe consequences for performance

Block access latencies increase when CPU(s) busy and run queues are long Immediate LMS scheduling is critical for predictable block access

latencies when CPU > 80% busy Fewer and busier LMS processes may be more efficient.

monitor their CPU utilization Caveat: 1 LMS can be good for runtime performance but may impact

cluster reconfiguration and instance recovery time the default is good for most requirements. gcs_server_processes init

parameter overrides defaults Higher priority for LMS is default

The implementation is platform-specific

Infrastructure: IPC configuration and Operating System

Page 21: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 21

PRIVATE INTERCONNECT CONSIDERATIONS

Interconnect should be dedicated non-routable subnet mapped to a single dedicated, non-shared VLAN

If VLANs are ‘trunked’ the interconnect VLAN traffic should not exceed the access switch layer

Minimize the impact of Spanning Tree events Monitor the switch(es) for congestion Avoid QoS definitions that may negatively impact interconnect

performance NIC driver dependent – DEFAULTS GENERALLY SATISFACTORY Confirm flow control: rx=on, tx=off Confirm full bit rate (1000) for the NICs Confirm full duplex auto-negotiate Ensure NIC names/slots identical on all nodes Configure interconnect NICs on fastest PCI bus Ensure compatible switch settings

802.3ad on NICs = 802.3ad on switch ports MTU=9000 on NICs = MTU=9000 on switch ports

FAILURE TO CONFIGURE THE NICS AND SWITCHES CORRECTLY WILL RESULT IN SEVERE

PERFORMANCE DEGRADATION AND NODE FENCING

The Interconnects, VLANs and NIC settings

Page 22: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 22

PRIVATE INTERCONNECT CONSIDERATIONS

Measure-ment

SMP Bus Memory channel

Myrinet SUN SCI Gigabit Ethernet(Gbe)

Infiniband(IB)

Latency(micro second)

0.5 3 7-9 10 100 <10

CPU overhead (micro second)

<1 <1 <1 ~100

Messages/sec(million)

>10 >2 <0.1

Bandwidth MB/sec

>500 >100 ~250 ~70 ~50 3Gbps with ability to aggregate

Page 23: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 23

AGGREGATION

Cisco Etherchannel based 802.3ad AIX Etherchannel HPUX Auto Port Aggregation SUN Trunking, IPMP, GLD Linux Bonding (only certain modes) Windows NIC teaming Aggregation Methods

Load balance/failover/load spreading spread on sends/serialize on receives

Active/Standby Oracle Interconnect Requirement

Both Send/Receive side load balancing NIC and Switch port failure detection

Page 24: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 24

COMMON PROBLEMS AND SYMPTOMS

gc [current][cr] block lost: This event shows block losses during transfers. High values indicate IPC, downstream network problems. ‘request retry’ event is likely to be seen .

global cache blocks corrupt: This statistic shows if any blocks were corrupted during transfers. If high values are returned for this statistic, there is probably an IPC, network or hardware problem.

global cache open s and global cache open x: The initial access of a particular data block by an instance generates these events. The duration of the wait should be short, and the completion of the wait is most likely followed by a read from disk. This wait is a result of the blocks that are being requested and not being cached in any instance in the cluster database. Pre-load heavily used tables into the buffer caches.

global cache null to s and global cache null to x: These events are generated by inter-instance block ping across the network. Interinstance block ping is when two instances exchange the same block back and forth. Reduce the number of rows per block to eliminate the need for block swapping between two instances in the RAC cluster.

global cache cr request: This event is generated when an instance has requested a consistent read data block and the block to be transferred has not arrived at the requesting instance. Placeholder event. Look for other gc events.

gc buffer busy: This event can be associated with a disk I/O contention for example slow disk I/O due to rogue query. Slow concurrent scans can cause buffer cache contention. However, note than there can be a multiple symptoms for the same cause. It can be seen together with ‘db file scattered reads’ event. Global cache access and serialization attributes to this event. Serialization is likely to be due to log flush time on another node or immediate block transfers.

Wait events worth investigation

Page 25: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 25

COMMON PROBLEMS AND SYMPTOMS

congested: The events that contain ‘congested’ suggest CPU, LMS saturation, long running queries, swapping, network configuration issues. Maintain a global view and remember that symptom and cause can be on different instances.

busy: The events that contain ‘busy’ indicate contention. It needs investigation by drilling down into either SQL with highest cluster wait time or segment statistics with highest block transfers. Also look at objects with highest number of block transfers and global serialization.

Gc [current/cr] [2/3]-way –Increase private interconnects bandwidth and decreasing the private interconnects latency.

Gc [current/cr] grant 2-way – Increase private interconnects bandwidth and decreasing the private interconnects latency.

Gc [current/cr][block/grant] congested – means that it has been received eventually but with a delay because of intensive CPU consumption, memory lack, LMS overload due to much work in the queues, paging, swapping. This is worth investigating as it provides a room for improvement. We will look at it later.

Gc [current/cr] block busy – Received but not sent immediately due to high concurrency or contention. This means that the block is busy. Variety of reasons for being busy just means cannot be sent immediately due to Oracle oriented reasons.

Gc current grant busy – Grant is received but there is a delay due to many shared block images or load.

Gc [current/cr][failure/retry] - Failure means that cannot receive the block image while retry means that the problem recovers and ultimately the block image can be received but it needs to retry. Investigate the IPC or downstream network problems.

Wait events worth investigation

Page 26: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 26

DIAGNOSTICS AND PROBLEM DETERMINATION

Tune for a single instance first Tune for RAC

Instance Recovery Interconnect traffic Points of serialization can be exacerbated

RAC–reactive tuning tools : Specific Wait events System and enqueue statistics Enterprise Manager performance pages AWR and ASH reports

RAC – proactive tools AWR snapshots ADDM reports

Page 27: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 27

DIAGNOSTICS AND PROBLEM DETERMINATION

Application tuning is often the most beneficial. Resizing and tuning the buffer cache. Reducing the long full-table scans in OLTP systems. Using Automatic Segment Space Management. Increasing sequence caches. Using partitioning to reduce inter-instance traffic. Avoid unnecessary parsing. Minimizing locking usage. Removing unselective indexes. Configuring Interconnect properly.

Most common RAC tuning tips

Page 28: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 28

DIAGNOSTICS AND PROBLEM DETERMINATION

Page 29: Oracle Clusterware and Private Network Considerations - Practical Performance Management for Oracle RAC

April 13, 2023 29

ORACLE CLUSTERWARE AND PRIVATE NETWORK CONSIDERATIONS- PRACTICAL

PERFORMANCE MANAGEMENT FOR ORACLE RAC

Questions &

Answers