Network Bandwidth Implications of Oracle Data Guard2

Network Bandwidth Implications of Oracle Data Guard

Introduction

Oracle Data Guard is Oracle's data protection and disaster recovery solution. One of the frequent questions that customers ask the Data Guard team is how much network bandwidth is required by Data Guard. A variation of the same question is: Can the network link between the production (or primary site) and disaster recovery (DR or secondary site) data center support a Data Guard configuration?

At a high level, the answer is simple enough. It depends on how busy the production database is. Let's look into it in a bit more detail.

It's the Redo

What is the basis of Data Guard's operation? Well, Data Guard sends the redo data generated by the primary database to one or more secondary, or standby databases. That's how Data Guard keeps the standby databases transactionally consistent with the primary database. The more redo data the primary database generates, the more redo data Data Guard needs to transmit to the standby database. In other words, the faster the primary database generates redo, the faster Data Guard needs to send the redo to the standby database, otherwise either the standby database may fall behind, or processing on the primary database may slow down (the exact behavior depends on the Data Guard protection mode chosen - more on it later). This means that the available network between the primary and standby databases must be capable of supporting this redo generation rate, or more precisely - the peak redo generation rate.

Since the amount of redo generated by a database is proportional to the transactional, or the write activity in the database, this implies that for a very busy OLTP (on-line transaction processing) system (e.g. the leading e-commerce websites), the network bandwidth required by Data Guard will be higher than that required by a non-OLTP system, or a system which supports primarily read-intensive transactions (e.g. systems for the technical support knowledge bank of a hi-tech company, or systems that allow you to view your present/previous credit card statements, your bank account balances, etc.).

What are the typical redo generation rates of Oracle production databases out there? The following graph, which summarizes the responses to the question "What is your peak redo generation rate?" from approximately 100 Oracle customers interested in Data Guard and attending OracleWorld San Francisco 2003, provides some insights:

It shows that 70%+ Oracle customers report a peak redo rate less than 500KB/sec.

Measuring the Peak Redo Rate

How does one measure the peak redo generation rate for a database? Use the Oracle Statspack utility for an accurate measurement of the redo rate.

Based on your business you should have a good idea as to what your peak periods of normal business activity are. For example, you may be running an online store which historically sees the peak activity for 4 hours every Monday between 10:00 am - 2:00 pm. Or, you may be running a merchandising database which batch-loads a new catalog every Thursday for 2 hours between 1 am - 3 am. Note that we say "normal" business activity - this means that in certain days of the year you may witness much heavier business volume than usual, e.g. the 2-3 days before Mother's Day or Valentine's Day for an online florist business. Just for those days, perhaps you may allocate higher bandwidth than usual, and you may not consider those as "normal" business activity. However, if such periodic surges of traffic are regularly expected as part of your business operations, you must consider them in your redo rate calculation.

During the peak duration of your business, run a Statspack snapshot at periodic intervals. For example, you may run it three times during your peak hours, each time for a five-minute duration. The Statspack snapshot report will include a "Redo size" line under the "Load Profile" section near the beginning of the report. This line includes the "Per Second" and "Per Transaction" measurements for the redo size in bytes during the snapshot interval. Make a note of the "Per Second" value. Take the highest "Redo size" "Per Second" value of these three snapshots, and that is your peak redo generation rate. For example, this highest "Per Second" value may be 394,253 bytes.

Note that if your primary database is a RAC database, you must run the Statspack snapshot on every RAC instance. Then, for each Statspack snapshot, sum the "Redo Size Per Second" value of each

instance, to obtain the net peak redo generation rate for the primary database. Remember that for a RAC primary database, each node generates its own redo and independently sends that redo to the standby database - hence the reason to sum up the redo rates for each RAC node, to obtain the net peak redo rate for the database.

Redo Generation Rate and the Required Network Bandwidth

The paper titled "Oracle9i Data Guard: Primary Site and Network Configuration Best Practices" available at http://otn.oracle.com/deploy/availability/htdocs/maa.htm, is part of Oracle Maximum Availability Architecture (MAA) series of white papers, and provides a useful framework to show the correlation between the peak redo rate and the required bandwidth (ref. Appendix F: Network Throughput and Peak Redo Rates). This article will not go into the details of the formula calculation since it is already explained in the paper. The formula used in the paper (assuming a conservative TCP/IP network overhead of 30%) is:

Required bandwidth = ((Redo rate bytes per sec. / 0.7) * 8) / 1,000,000 = bandwidth in Mbps

Thus, our example of 385 KB/sec peak rate would require an available network bandwidth of at least

((394253 / 0.7) * 8) / 1,000,000 = 4.5 Mbps.

For this Data Guard configuration, a standard T1 line primary-standby connection providing up to 1.544 Mbps will not be adequate. However, a T3 connection (typically providing up to 44.736 Mbps) may be more than adequate, provided of course this connection is not heavily shared by other applications that may reduce the effective bandwidth for the primary-standby connection. This means that while the peak redo generation rate is a good indication of your Data Guard-related network requirements, make sure that while specifying your network requirements with your network service provider, you also consider other applications and their Service Level Agreements (SLAs) that may be sharing this network. Remember - the formula above indicates the network bandwidth that should be available to Data Guard, it does not indicate what the entire network bandwidth should be between your primary and DR data centers.

If this network link may be shared with other critical apps, consider configuring a higher bandwidth network e.g. dark fibre, OC1, or OC3, and/or using Quality of Service (QoS) to prioritize network traffic or to allocate dedicated bandwidth to a particular class of traffic, to prevent bursty traffic adversely affecting your latency-sensitive traffic (such as Data Guard redo traffic).

Data Guard Protection Modes and the Network

Data Guard can be configured in one of three protection modes - Maximum Protection, Maximum Availability or Maximum Performance. These protection modes essentially differ in the following:

their recommended redo data transport settings, the behavior of the primary when the last standby in the chosen protection mode is unavailable,

and their capabilities for zero data loss in the event of a disaster at the primary site.

For Maximum Protection and Maximum Availability, the redo data transport setting requires the LGWR SYNC AFFIRM attributes in the log_archive_dest_n entry for the particular standby. For the Maximum Performance mode, the redo transport is set to the LGWR ASYNC, or alternatively, ARCH attributes.

Synchronous transport, as implied by LGWR SYNC AFFIRM attributes, means that primary database transactions are not committed till they are also available on disk on the standby. This implies that a possible impact on the production transactions is correlated to the latency of the network link between the primary and the standby. Since the latency or round trip time for a network is usually correlated to the length of the network, or the physical distance between the two end points (in this case the primary and standby), Maximum Protection and Maximum Availability modes are not recommended for Data Guard deployments over a Wide Area Network (WAN). Note that this recommendation is driven by the laws of physics (speed of light limitation) - the greater the distance of a network, the longer it will take for data packets to traverse the network, and hence the longer it will take for primary database transactions to commit.

For WAN deployments of Data Guard, the Maximum Performance protection mode is recommended. All three protection modes can however be used for Local Area Network (LAN) or Metropolitan Area Network (MAN) deployments of Data Guard. As demonstrated in the previously mentioned MAA paper titled "Oracle9i Data Guard: Primary Site and Network Configuration Best Practices", Maximum Protection/Availability modes are viable for a Data Guard deployment of approximately 345 miles, with minimal performance impact (no more than 3% in Oracle's internal tests) on the primary. A US coast-to-coast (i.e. WAN) deployment of Data Guard using the Maximum Performance mode has almost no performance impact (1% in tests) on the primary.

Network Bandwidth Issues During Standby Creation

If you are creating the standby database from a backup of your multi-terabyte production database, an issue that you have to resolve is how to ship the initial backup to the standby site. Sending this initial multi-terabyte backup to the standby site over the network may not be feasible. You may be better off by shipping the backup tape(s) to the standby site and subsequently using the network to copy incremental backups to the standby site.

Data Guard provides an important optimization in this regard. While the backup tapes are in transit, the standby database may be mounted and started, based on the standby control file and initialization file sent to the standby site over the network. In such a situation, the standby database acts as an archive log repository. For example, any archive logs generated at the primary server since the backup of the primary database can be manually copied to the standby site over the network. Also, after redo shipping is enabled on the primary, any new redo data generated on the primary can automatically be sent to the standby server by Data Guard. This redo data is not applied to the standby database since it is not yet fully restored with the backups, but at least the archive logs will be available at the standby site. This way, Data Guard minimizes any risk of data loss in the event of a severe outage at the primary server while the backup tapes are in transit, and enables faster synchronization of the standby database with the primary since the required archive logs are already available locally at the standby site.

Once the backup tapes arrive at the standby site and the backups (full and incremental) are restored at the standby database, the standby database and the apply process can be started. All accumulated redo data at the standby site will now be automatically applied to the standby database. If necessary, Data Guard will use the network to automatically send any new primary database archive logs, or any missing archive logs, to the standby site and rapidly bring the standby up-to-date with the primary database.

What if I have a Slow Network?

If you have a slow network link between the production and DR data centers, seriously consider upgrading the network. Remember, Disaster Recovery is not an area where you would want to cut corners, especially if your business has strict availability requirements. In case there is a severe outage at the production site and your business operations are down, the last thing that you want to do at that

critical moment is to figure out how much data you might have lost because redo data was not shipped to the standby because of a slow network, or figure out how much the standby database is behind the currently unavailable primary database.

Data Guard does provide you with some additional options in case you want to reduce the demands on your network resources for a highly active production database. If you have configured multiple standbys, consider the Cascaded Redo Log Destinations feature, with which you can have one standby database sending redo data to one or more standby databases, instead of requiring the primary database to send this redo to all standbys. This feature not only saves network resource consumption around the production data center, but also saves valuable processing cycles for the production database.

Another option that you may evaluate is configuring the link with SSH port forwarding with compression. For a high latency low bandwidth network, SSH port forwarding is recommended for Maximum Performance mode. Oracle's internal tests in a high latency WAN showed that using SSH with compression made a significant reduction in network traffic and reduction in redo data transfer time. Refer to the "Oracle9i Data Guard: Primary Site and Network Configuration Best Practices" paper for further details on the test results. Please also refer to the MetaLink Note 225633.1 "Implementing SSH port forwarding with 9i Data Guard" for configuration guidelines.

For additional guidelines related to tuning the relevant parameters for Data Guard, Oracle Net Services and your operating system, refer to this Oracle9i Data Guard: Primary Site and Network Configuration Best Practices" paper as well as the following MetaLink Notes:

MetaLink Note 241925.1 "Troubleshooting 9i Data Guard Network Issues" MetaLink Note 260040.1: "Refining Remote Archival Over a Slow Network with the ARCH

Process"

A question that we do commonly get for this slow network issue is whether there is any way Data Guard can filter out selected transactions before sending the redo data to the standby sites. The answer is no. Every bit of redo data that is generated on the primary database will be sent over to the standby site - no filtering is possible. Make sure you understand the rationale here - Data Guard is a disaster recovery mechanism, so the general goal should be to keep the standby databases transactionally consistent with the primary, such that during a switchover or a failover, a chosen standby database may easily be transitioned into a primary role. If you need to transform/filter your redo data before sending that over to the standby site, consider an alternative solution such as Oracle Streams. Unlike Data Guard, Streams also allows the replication of a subset of of the tables on the source database to the target database, and that could be another way to ensure that only the data that needs to be protected is transmitted across the network, especially when the available network bandwidth is not enough to keep up with the redo generation rate.

Note that after the redo data reaches the standby site, Data Guard SQL Apply (logical standby database) offers some flexibility in that it allows you to skip applying that redo for certain tables. This is not possible with Data Guard Redo Apply (physical standby database), which by definition is a block-for-block copy of the primary database.

A follow-up question is whether one can do NOLOGGING operations on the primary database in a Data Guard configuration, to reduce the load in the network. The answer is that one shouldn't do it. The redo data is the basis of Data Guard's operations. Since nologging operations write directly to the data files and bypass the redo logs, Data Guard will not be able to keep the standby database consistent with the primary during nologging operations. In fact, to ensure this doesn't happen, Oracle9i introduced the command ALTER DATABASE FORCE LOGGING; to make sure that all database write operations are logged. It is always a good idea to run this command on your production database so that you are protected against any application that may have NOLOGGING operations in-built in its code. Refer to the MetaLink Note 216211.1 "Nologging In The E-Business Suite" for further details in this matter.

Conclusion

This article focused on the network bandwidth implications of a Data Guard configuration. The objective of the article was to convey the most relevant issues in a concise manner and provide readers with helpful pointers for further reading.

Network bandwidth management is not a one-off exercise. It needs careful planning, review and understanding of SLAs for the supported applications, as well as SLAs promised by the network service provider, and continuous monitoring of the network to ensure that the business operations goals and availability requirements are being met. The good thing for administrators is that several bandwidth management and monitoring tools are available in the market, that allow administrators to extract the maximum value of their network connectivity investments, instead of buying extra bandwidth that ultimately may not be necessary.

Data Guard is an excellent choice for data protection and disaster recovery not just because of its comprehensive functionality, but also because of the way it is optimally architected to handle data transmission issues over a network. It is based on standard TCP/IP protocols, which means organizations can leverage existing resources, and not buy extra hardware, or incur extra training. The redo transmission is optimal - even though a write transaction affects the redo log files, archive log files and data files, Data Guard sends only the redo data to keep the standby databases synchronized with the primary. This is in contrast to certain storage-level remote mirroring solutions which may send all of those changes, requiring up to 3 times more network resource consumption. Data Guard also offers administrators the flexibility to configure their desired redo transport mechanism based on their business requirements. Finally Data Guard comes with a rich set of configuration guidelines and best practice blueprints that make it easy to implement and use.

References

1. Oracle Data Guard Overview 2. Oracle Data Guard Concepts and Administration Manual 3. Oracle9 i Database Performance Tuning Guide and Reference Manual - Chap. 21: Using

Statspack 4. MetaLink Note 94224.1 - "FAQ - Statspack Complete Reference" 5. Oracle Maximum Availability Architecture 6. Oracle9 i Data Guard: Primary Site and Network Configuration Best Practices 7. MetaLink Note 225633.1 - "Implementing SSH port forwarding with 9i Data Guard" 8. MetaLink Note 241925.1 - "Troubleshooting 9i Data Guard Network Issues" 9. MetaLink Note 260040.1 - "Refining Remote Archival Over a Slow Network with the ARCH

Process" 10. Oracle Streams Overview 11. MetaLink Note 216211.1 - "Nologging In The E-Business Suite"

Documents

Network Bandwidth Implications of Oracle Data Guard2