Cluster Ware Network[1]

Embed Size (px)

Citation preview

  • 8/4/2019 Cluster Ware Network[1]

    1/48

    Copyright 2008, Oracle. All rights reserved.1

    Oracle Clusterware and Private Network Considerations

    Much of this presentation is attributed to Michael Zoll and work done by the RAC Performance Development group

  • 8/4/2019 Cluster Ware Network[1]

    2/48

    Copyright 2008, Oracle. All rights reserved.2

    The following is intended to outline our general

    product direction. It is intended for information

    purposes only, and may not be incorporated into any

    contract. It is not a commitment to deliver any

    material, code, or functionality, and should not berelied upon in making purchasing decisions.

    The development, release, and timing of any

    features or functionality described for Oracles

    products remains at the sole discretion of Oracle.

  • 8/4/2019 Cluster Ware Network[1]

    3/48

    Copyright 2008, Oracle. All rights reserved.3

    Architectural Overview

    RAC and Cache Fusion Performance

    Infrastructure

    Common Problems and ResolutionAggregation and VLANs

    Agenda

  • 8/4/2019 Cluster Ware Network[1]

    4/48

    Copyright 2008, Oracle. All rights reserved.4

    Oracle Clusterware

    public network

    EVMD

    CRSD

    OPROCD

    ONS

    VIP1

    CSSD

    VIP2 VIPn

    //

    shared storage

    OCR and Voting DisksRAW Devices

    CSSDRuns in Real

    Time Priority

    OS

    EVMD

    CRSD

    OPROCD

    ONS

    CSSD

    OS

    EVMD

    CRSD

    OPROCD

    ONS

    CSSD

    OS

    ClusterPrivate High Speed

    Network

    L2/L3 SWITCH

  • 8/4/2019 Cluster Ware Network[1]

    5/48

    Copyright 2008, Oracle. All rights reserved.5

    Under the Covers

    Redo Log Files

    Node nNode 2

    Data Files and Control Files

    Redo Log Files Redo Log Files

    Dictionary

    CacheLog buffer

    VKTM LGWR DBW0

    SMON PMON

    LibraryCache

    Global Resoruce Directory

    LMS0

    Instance 2

    SGA

    Instance n

    Buffer Cache

    LMON LMD0 DIAG

    Dictionary

    CacheLog buffer

    VKTM LGWR DBW0

    SMON PMON

    LibraryCache

    Global Resoruce Directory

    LMS0

    Buffer Cache

    LMON LMD0 DIAG

    Dictionary

    Cache

    Log buffer

    VKTM LGWR DBW0

    SMON PMON

    Library

    Cache

    Global Resoruce Directory

    LMS0

    Buffer Cache

    LMON LMD0 DIAG

    Instance 1

    Node 1

    SGA SGA

    Runs in Real

    Time Priority

    ClusterPrivate High Speed Network

    L2/L3 SWITCH

  • 8/4/2019 Cluster Ware Network[1]

    6/48

    Copyright 2008, Oracle. All rights reserved.6

    Global Cache Service (GCS)

    Manages coherent access to data in buffer caches of allinstances in the cluster

    Minimizes access time to data which is not in local

    cache

    access to data in global cache faster than disk

    access

    Implements fast direct memory access over high-speed

    interconnects

    for all data blocks and types

    Uses an efficient and scalable messaging protocol

    Never more than 3 hops

    New optimizations for read-mostly applications

  • 8/4/2019 Cluster Ware Network[1]

    7/48Copyright 2008, Oracle. All rights reserved.7

    Cache Hierarchy: Data in Remote Cache

    Local Cache Miss

    Datablock Requested

    Datablock Returned

    Remote Cache Hit

  • 8/4/2019 Cluster Ware Network[1]

    8/48Copyright 2008, Oracle. All rights reserved.8

    Cache Hierarchy: Data On Disk

    Local Cache Miss

    Datablock Requested

    Grant Returned

    Remote Cache Miss

    Disk Read

  • 8/4/2019 Cluster Ware Network[1]

    9/48Copyright 2008, Oracle. All rights reserved.9

    Cache Hierarchy: Read Mostly

    Local Cache Miss

    No Message required Disk Read

  • 8/4/2019 Cluster Ware Network[1]

    10/48Copyright 2008, Oracle. All rights reserved.10

    11.1

    CPU Optimizations for read-intensive operations

    Read-only access

    No messages

    Direct reads

    Read-mostly access

    Message reductions

    Latency improvements

    Significant gains

    From 50-70% reductions measured

  • 8/4/2019 Cluster Ware Network[1]

    11/48Copyright 2008, Oracle. All rights reserved.11

    Performance of Cache Fusion

    Message:~200 bytes

    Block: e.g. 8K

    LMS

    Initiate send and wait

    Receive

    Process block

    Send

    Receive

    200 bytes/(1 Gb/sec )

    8192 bytes/(1 Gb/sec)

    Total access time: e.g. ~360 microseconds (UDP over GBE)

    Network propagation delay ( wire time ) is a minor factor for roundtrip time

    ( approx.: 6% , vs. 52% in OS and network stack )

  • 8/4/2019 Cluster Ware Network[1]

    12/48Copyright 2008, Oracle. All rights reserved.12

    Fundamentals: Minimum Latency (*), UDP/GBE

    and RDS/IB

    0.200.160.130.12RDS/IB

    0.460.360.310.30UDP/GE

    16K8K4K2KBlock sizeRT (ms)

    (*) roundtrip, blocks are not busy i.e. no log flush, no serialization ( buffer busy)

    AWR and Statspack reports would report averages as if they were normally distributed, the

    session wait history which is included in Statspack in 10.2 and AWR in 11g will show theactual quantiles

    The minimum values in this table are the optimal values for 2-way and 3-way block

    transfers, but can be assumed to be the expected values ( I.e. 10ms for a 2-way block

    would be very high )

  • 8/4/2019 Cluster Ware Network[1]

    13/48Copyright 2008, Oracle. All rights reserved.13

    PROCESS

    (FG/LMS*)

    PROCESS

    (FG/LMS*)

    SOCKET

    LAYER

    (tx Buffers)

    IP LAYER

    TCP / UDP

    L2/L3 SWITCH

    IP LAYER

    TCP / UDP

    INTERFACE

    LAYER

    INTERFACE

    LAYER

    TX RX

    SOCKET

    LAYER

    (rx Buffers)

    Infrastructure: Network Packet Processing

  • 8/4/2019 Cluster Ware Network[1]

    14/48Copyright 2008, Oracle. All rights reserved.14

    Infrastructure: Interconnect Bandwidth

    Bandwidth requirements depend on several factors

    ( e.g. buffer cache size, #of CPUs per node, access

    patterns) and cannot be predicted precisely for every

    application

    Typical utilization approx. 10-30% inOLTP

    10000-12000 8K blocks per sec to saturate 1 x GbEthernet ( 75-80% of theoretical bandwidth )

    Generally, 1Gb/sec sufficient for performance and

    scalability in OLTP.

    DSS/DW systems should be designed with > 1Gb/seccapacity

    A sizing approach with rules of thumb is described in

    Project MegaGrid: Capacity Planning for Large

    Commodity Clusters (http://otn.oracle.com/rac)

    http://otn.oracle.com/rachttp://otn.oracle.com/rac
  • 8/4/2019 Cluster Ware Network[1]

    15/48Copyright 2008, Oracle. All rights reserved.15

    Infrastructure: Private Interconnect

    Network between the nodes of a RAC cluster MUST

    be private

    Supported links: GbE, IB ( IPoIB: 10.2 )

    Supported transport protocols: UDP, RDS (10.2.0.3) Use multiple or dual-ported NICs for redundancy and

    increase bandwidth with NIC bonding

    Large ( Jumbo ) Frames for GbE recommended if the

    global cache workload requires it.

    global cache block shipping versus small lock

    message passing.

    N k P k P i L Q d B ff

  • 8/4/2019 Cluster Ware Network[1]

    16/48Copyright 2008, Oracle. All rights reserved.16

    PROCESS

    (FG/LMS*)

    PROCESS

    (FG/LMS*)

    SOCKET

    LAYER

    (tx Buffers)

    IP LAYER

    TCP / UDP

    L2/L3 SWITCH

    Ingress

    Buffers

    Egress

    Buffers

    IP LAYER

    TCP / UDP

    INTERFACE

    LAYER

    INTERFACE

    LAYER Hardware interrupts

    RX queues, RX

    buffers

    Software interrupts

    RX IP input queue

    TX

    Socket queues

    RX

    SOCKET

    LAYER

    (rx Buffers)

    USER

    KERNEL

    Backplane pressure cpu

    Socket buffers

    TX IP

    queues

    Network Packet Processing: Layers, Queues and Buffers

    Rec()

  • 8/4/2019 Cluster Ware Network[1]

    17/48Copyright 2008, Oracle. All rights reserved.17

    Infrastructure: IPC configuration

    Important Settings:

    Negotiated top bit rate and full duplex mode

    NIC ring buffers

    Ethernet flow control settings

    CPU(s) receiving network interrupts

    Verify your setup:

    CVU does checking

    Load testing eliminates potential for problems

    AWR and ADDM give estimations of link utilization

    Buffer overflows, congested links and flow control can

    have severe consequences for performance

  • 8/4/2019 Cluster Ware Network[1]

    18/48Copyright 2008, Oracle. All rights reserved.18

    Infrastructure: Operating System

    Block access latencies increase when CPU(s) busy

    and run queues are long Immediate LMS scheduling is critical for

    predictable block access latencies when CPU >

    80% busy

    Fewer and busier LMS processes may be more

    efficient.

    monitor their CPU utilization

    Caveat: 1 LMS can be good for runtime

    performance but may impact cluster

    reconfiguration and instance recovery time the default is good for most requirements

    Higher priority for LMS is default

    The implementation is platform-specific

  • 8/4/2019 Cluster Ware Network[1]

    19/48Copyright 2008, Oracle. All rights reserved.19

    Common Problems and Symptoms

    Lost Blocks: Interconnect or Switch Problems

    System load and scheduling

    Contention

    Unexpectedly high global cache latencies

  • 8/4/2019 Cluster Ware Network[1]

    20/48Copyright 2008, Oracle. All rights reserved.20

    Miss-configured or Faulty Interconnect Can

    Cause:

    Dropped packets/fragments

    Buffer overflows

    Packet reassembly failures or timeouts

    Ethernet Flow control kicks inTX/RX errors

    lost blocks at the RDBMS level, responsible for64% of escalations

  • 8/4/2019 Cluster Ware Network[1]

    21/48Copyright 2008, Oracle. All rights reserved.21

    Lost Blocks: NIC Receive Errors

    Db_block_size = 8K

    ifconfig a:

    eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04

    inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0

    UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

    RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95

    TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0

  • 8/4/2019 Cluster Ware Network[1]

    22/48Copyright 2008, Oracle. All rights reserved.22

    Lost Blocks: IP Packet Reassembly Failures

    netstat s

    Ip:

    84884742 total packets received

    1201 fragments dropped after timeout

    3384 packet reassembles failed

  • 8/4/2019 Cluster Ware Network[1]

    23/48Copyright 2008, Oracle. All rights reserved.23

    Top 5 Timed Events Avg %Total

    ~~~~~~~~~~~~~~~~~~ wait Call

    Event Waits Time(s)(ms) Time Wait

    Class

    ----------------------------------------------------------------------------------------------------

    log file sync 286,038 49,872 174 41.7 Commit

    gc buffer busy 177,315 29,021 164 24.3 Cluster

    gc cr block busy 110,348 5,703 52 4.8 Cluster

    gc cr block lost 4,272 4,953 1159 4.1 Cluster

    cr request retry 6,316 4,668 739 3.9 Other

    Finding a Problem with the Interconnect or IPC

    Should never be here

  • 8/4/2019 Cluster Ware Network[1]

    24/48Copyright 2008, Oracle. All rights reserved.24

    Global Cache Lost block handling

    Detection Time in 11g reduced 500ms ( around 5 secs in 10g )

    can be lowered if necessary

    robust ( no false positives )

    no extra overhead

    Cr request retryevent related to lost blocks

    It is highly likely to see it when gc cr blocks lost

    show up

  • 8/4/2019 Cluster Ware Network[1]

    25/48Copyright 2008, Oracle. All rights reserved.25

    Interconnect Statistics

    Automatic Workload Repository (AWR )

    Target Avg Latency Stddev Avg Latency Stddev

    Instance 500B msg 500B msg 8K msg 8K msg

    ---------------------------------------------------------------------

    1 .79 .65 1.04 1.06

    2 .75 .57 . 95 .78

    3 .55 .59 .53 .59

    4 1.59 3.16 1.46 1.82

    ---------------------------------------------------------------------

    Latency probes for different message sizes

    Exact throughput measurements ( not shown)

    Send and receive errors, dropped packets ( not shown )

  • 8/4/2019 Cluster Ware Network[1]

    26/48

    Copyright 2008, Oracle. All rights reserved.26

    Blocks Lost: Solution

    Fix interconnect NICs and switches

    Tune IPC buffer sizes

  • 8/4/2019 Cluster Ware Network[1]

    27/48

    Copyright 2008, Oracle. All rights reserved.27

    CPU Saturation or Long Run Queues

    Top 5 Timed Events Avg %Total~~~~~~~~~~~~~~~~~~ wait Call

    Event Waits Time(s) (ms) Time Wait

    Class----------------- --------- ------ ---- ---------------

    db file sequential 1,312,840 21,590 16 21.8 User I/Oread

    gc current block 275,004 21,054 77 21.3 Clustercongested

    gc cr grant congested 177,044 13,495 76 13.6 Cluster

    gc current block 1,192,113 9,931 8 10.0 Cluster2-way

    gc cr blockcongested 85,975 8,917 104 9.0 ClusterCongested : LMS could not dequeue messages fast enough

    Cause : Long run queue, CPU starvation

  • 8/4/2019 Cluster Ware Network[1]

    28/48

    Copyright 2008, Oracle. All rights reserved.28

    High CPU Load: Solution

    Run LMS at higher priority (default)

    Start more LMS processes

    Never use more LMS processes than CPUsReduce the number of user processes

    Find cause of high CPU consumption

  • 8/4/2019 Cluster Ware Network[1]

    29/48

    Copyright 2008, Oracle. All rights reserved.29

    Contention

    Event Waits Time (s) AVG (ms) % Call

    Time

    ---------------------- --------- -------- -------- --------

    gc cr block 2-way 317,062 5,767 18 19.0

    gc current block 2-way 201,663 4,063 20 13.4

    gc bufferbusy 111,372 3,970 36 13.1

    CPU time 2,938 9.7

    gc cr blockbusy 40,688 1,670 41 5.5-------------------------------------------------------

    Global Contention on Data Serialization

    Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related

  • 8/4/2019 Cluster Ware Network[1]

    30/48

    Copyright 2008, Oracle. All rights reserved.30

    Contention: Solution

    Identify hot blocks in application

    Reduce concurrency on hot blocks

  • 8/4/2019 Cluster Ware Network[1]

    31/48

    Copyright 2008, Oracle. All rights reserved.31

    High Latencies

    Event Waits Time (s) AVG (ms) % Call

    Time

    ---------------------- ---------- ---------- --------- --------

    gc cr block 2-way 317,062 5,767 18 19.0

    gc current block 2-way 201,663 4,063 20 13.4

    gc buffer busy 111,372 3,970 36 13.1

    CPU time 2,938 9.7

    gc cr block busy 40,688 1,670 41 5.5

    -------------------------------------------------------

    Tackle latency first, then tackle busy events

    Expected: To see 2-way, 3-way

    Unexpected: To see > 1 ms (AVG ms should be around 1 ms)

  • 8/4/2019 Cluster Ware Network[1]

    32/48

    Copyright 2008, Oracle. All rights reserved.32

    High Latencies : Solution

    Check network configuration

    Private

    Running at expected bit rateFind cause of high CPU consumption

    Runaway or spinning processes

  • 8/4/2019 Cluster Ware Network[1]

    33/48

    Copyright 2008, Oracle. All rights reserved.33

    Health Check

    Look for:Unexpected Events

    gc cr block lost 1159 ms

    Unexpected Hints

    Contention and Serializationgc cr/current block busy 52 ms

    Load and Schedulinggc current block congested 14 ms

    Unexpected high avggc cr/current block 2-way 36 ms

  • 8/4/2019 Cluster Ware Network[1]

    34/48

    Copyright 2008, Oracle. All rights reserved.34

    Gigabit Ethernet Definition

    Max Bandwidth 1000 Mbits = 125 MB per sec, excluding header and pauseframes 118 MB per sec

    Equates to 85000 Clusterware/RAC messages or 14000 8k

    blocks

    RAC workload has a mix of short messages of 256 bytes and

    db_block_size of long messages

    For real life workload only 60-70% the bandwidth can be

    sustained

    For RAC type workload 40 MB per sec per interface is optimalload

    For additional bandwidth more interfaces can be aggregated

  • 8/4/2019 Cluster Ware Network[1]

    35/48

    Copyright 2008, Oracle. All rights reserved.35

    1 U

    ce2

    NIC NIC

    ce4:1

    $ --> ifconfig -a

    ce2: flags=69040843 mtu 1500 index 3

    inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255groupnameprivate

    ce4: flags=9040843 mtu 1500 index 8

    inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255

    groupnameprivate

    ce4:1: flags=1000843 mtu 1500 index 8

    inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

    ce4

    Aggregation Active/Standby(single switch)

  • 8/4/2019 Cluster Ware Network[1]

    36/48

    Copyright 2008, Oracle. All rights reserved.36

    1 U

    ce2

    NIC NIC

    ce4:1

    $ --> ifconfig -a

    ce2: flags=69040843 mtu 1500 index 3

    inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255groupnameprivate

    ce4: flags=9040843 mtu 1500 index 8

    inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255

    groupnameprivate

    ce4:1: flags=1000843 mtu 1500 index 8

    inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

    ce4

    Aggregation Active/Active(Single Switch)

  • 8/4/2019 Cluster Ware Network[1]

    37/48

    Copyright 2008, Oracle. All rights reserved.37

    1 U

    1 U

    ce2ce4

    ce8ce10

    NIC

    NICce4:1

    ce2ce4

    ce8ce10

    NIC

    NICce4:1

    $ --> ifconfig -a

    ce10: flags=69040843 mtu 1500 index 3

    inet 192.168.83.36 netmask ffffff00 broadcast 192.168.83.255groupnameprivate

    ce4: flags=9040843 mtu 1500 index 8

    inet 192.168.83.35 netmask ffffff00 broadcast 192.168.83.255

    groupnameprivate

    ce4:1: flags=1000843 mtu 1500 index 8

    inet 192.168.83.37 netmask ffffff00 broadcast 192.168.83.255

    Aggregation Active/Standby(Switch Redundancy)

  • 8/4/2019 Cluster Ware Network[1]

    38/48

    Copyright 2008, Oracle. All rights reserved.38

    Aggregation Solutions

    Cisco Etherchannel based 802.3adAIX Etherchannel

    HPUX Auto Port Aggregation

    SUN Trunking, IPMP, GLD

    Linux Bonding (only certain modes)Windows NIC teaming

  • 8/4/2019 Cluster Ware Network[1]

    39/48

    Copyright 2008, Oracle. All rights reserved.39

    Aggregation Methods

    Load balance/failover/load spreading

    spread on sends/serialize on receives

    Active/Standby

    Oracle Interconnect Requirement

    Both Send/Receive side load balancing

    NIC and Switch port failure detection

  • 8/4/2019 Cluster Ware Network[1]

    40/48

    Copyright 2008, Oracle. All rights reserved.40

    General Interconnect requirement Recommendations

    For OLTP Workloads Normally 1 Gbit Ethernet with redundancy

    (active/standby or load-balance) is sufficient

    For DW workloads

    Multiple GigE aggregated10 Gig E or Infiniband

  • 8/4/2019 Cluster Ware Network[1]

    41/48

    Copyright 2008, Oracle. All rights reserved.41

    Oracle RAC Cluster Interconnect network selection

    Oracle Clusterware

    IP address associated with Private Hostname

    (provided during Install interview)

    Oracle RAC Database

    Private Network specified during the Install interview Cluster_interconnect parameter provided IP address

  • 8/4/2019 Cluster Ware Network[1]

    42/48

    Copyright 2008, Oracle. All rights reserved.42

    Jumbo Frames

    Non-IEEE standard

    Useful for NAS/iSCSI storage

    Network device inter-operability issues

    Configure with care and test rigorously

    Excerpt from alert.log:

    Maximum Tranmission Unit (mtu) of the ether

    adapter is different on the node running instance 4,

    and this node. Ether adapters connecting the

    cluster nodes must be configured with identical mtu

    on all the nodes, for Oracle. Please ensure the mtuattribute of the ether adapter on all nodes [and

    switch ports]are identical, before running Oracle.

  • 8/4/2019 Cluster Ware Network[1]

    43/48

    Copyright 2008, Oracle. All rights reserved.43

    UDP Socket Buffer(rx)

    Default settings adequate for majority of customers

    May need to increase allocated buffer size

    MTU size increases

    netstat reporting fragmentation and/or reassembly

    errors

    ifconfig reporting dropped packets or overflow

  • 8/4/2019 Cluster Ware Network[1]

    44/48

    Copyright 2008, Oracle. All rights reserved.44

    Cluster Interconnect NIC settings

    NIC driver dependent DEFAULTS GENERALLY SATISFACTORY

    Changes can occur between OS versions

    Linux 2.4 => 2.6 kernels,flowcontrol on e1000 drivers

    NAPI interrupt coalescence in 2.6

    Confirm flow control: rx=on, tx=off

    Confirm full bit rate (1000) for the NICs

    Confirm full duplex auto-negotiate

    Ensure NIC names/slots identical on all nodes

    Configure interconnect NICs on fastest PCI bus

    Ensure compatible switch settings

    802.3ad on NICs = 802.3ad on switch ports

    MTU=9000 on NICs = MTU=9000 on switch portsFAILURE TO CONFIGURE THE NICS AND SWITCHES CORRECTLY WILL RESULT IN SEVERE

    PERFORMANCE DEGRADATION AND NODE FENCING

  • 8/4/2019 Cluster Ware Network[1]

    45/48

    Copyright 2008, Oracle. All rights reserved.45

    The Interconnect and VLANs

    Interconnect should be dedicated non-routable subnetmapped to a single dedicated, non-shared VLAN

    If VLANs are trunked the interconnect VLAN traffic

    should not exceed the access switch layer

    Minimize the impact of Spanning Tree events Monitor the switch(es) for congestion

    Avoid QoS definitions that may negatively impact

    interconnect performance

  • 8/4/2019 Cluster Ware Network[1]

    46/48

    Copyright 2008, Oracle. All rights reserved.46

  • 8/4/2019 Cluster Ware Network[1]

    47/48

    Copyright 2008, Oracle. All rights reserved.47

  • 8/4/2019 Cluster Ware Network[1]

    48/48

    AQ&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S