Progress Openedge performance management

Progress OE performance management

Yassine MOALLA

Progress OE application support

February 2017

Introduction

• Imagine you are an owner of a restaurant.

• You are employing two chefs, one broker and five servers.

• This is the launch time and many clients are coming in abundance.

• The broker is assigning servers to clients

• Servers are welcoming the clients, installing them and taking their orders.

• clients are waiting…. Then waiting…. Then waiting… Then they become angry and they begin to leave your restaurant.

• What’s happening? Why are meals not prepared quickly? What are chefs doing?

• You go to the kitchen and check.

Definition

• Application Performance Management (APM) is measuring application performance and tuning resources used by the application in order to maintain an expected level of service.

Performance management

QAD application performance depends on many factors:• CPU: Manipulates data and executes programs• Memory: Stores data so it can be accessed quickly while performing

operations• Disks controllers: Read and write data to disks• Operating system mechanism: Allocate and use system resources• Network: Exchange data between clients and servers• ….. and many other resources Let we agree on some indicators which indicates the performance status.

CPU

• Let we imagine that the CPU is the Chef

• The chef manipulates ingredients (data) and prepares a plate according to a customer order (a client’s query) .

• If clients are complaining and your chef is busy at 100% Your chef is overloaded: You need to recruit an other chef to help him.

CPU

• If clients are complaining; and your Chef is taking a rest (not busy) Your chef might be waiting for something: Identify the bottleneck*and eliminate it.

• *Performance bottlenecks occur when a resource performs inadequately and prevents other resources from accomplishing their work.

CPU

CPU threshold

• CPUs are overloaded when run queue size is greater than Number of CPUs on the machine

• CPUs are not busy when CPU are waiting or in idle status:Case of High IO Read Load • procs -----------memory---------- ---swap-- -----io--- --system-- -----cpu------

• r b swpd free buff cache si so bi bo in cs us sy Id wa st

• 3 1 465872 36132 82588 1018364 7 17 70 127 214 838 12 3 82 3 0

• 0 1 465872 33796 82620 1021820 0 0 34592 0 357 781 6 10 0 84 0

• 0 1 465872 36100 82656 1019660 0 0 34340 0 358 723 5 9 0 86 0

• 0 1 465872 35744 82688 1020416 0 0 33312 0 345 892 8 11 0 81 0

• 0 1 465872 35716 82572 1020948 0 0 34592 0 358 738 7 8 0 85 0

Disks controllers

• Disk I/O operation is a common bottleneck because reading and writing data to disk is a relatively slow operation.• To come back to our example, imagine that every time the chef (CPU) needs

an ingredient (data) he has to ask his assistant (database server) to go to the storage (database) and bring the ingredient (data). You agree that with this way the chef is spending a lot of time in waiting for ingredients (data). Solution is: Minimize time to retrieve the ingredients (data). But how?

• Storing most useful ingredients (data) on a table (buffer pool) and put it just near the chef (CPU) is a solution to minimize Disk I/O operations time.

• Every time the chef uses ingredients (data) from buffer pool, we call that: Buffer hit.

• If buffer hits is around 99% The chef is finding the ingredients he needs at 99% of the time -B db startup parameter is well tuned

Memory/Buffers

Buffer hits

• Let we call:• A data read from buffer pool Logical read• A data written to buffer pool Logical write• A data read from disk OS read• A data written to disk OS write• Buffer hits = (( logical reads + logical writes ) - (OS reads + OS writes)) /

(logical reads + logical writes).• Buffer hits must be 99% or more. Why?

• 95% = 5 OS read/write of 100 went to disk• 99% = 1 OS read/write of 100 went to disk• Conclusion: 99% is 500% better than 95% and not 5%.

• If buffer hits is below 99% most of the time The table space (buffer pool) is not big enough to include all useful ingredients (data). Need to increase the table (buffer pool) space Increase –B db startup parameter.

Tuning Buffers

• The PRO of using buffers is to improve performance by decreasing disk I/O operations.

• The CON of using too many buffers is to let the system swapping so make disk I/O operations.

Be careful!!

• If the chef doesn’t find an ingredient (data) he needs on the table (buffer pool), he will ask his assistant (database server) to bring it. The assistant will take a least recently used (LRU) ingredient from the table to the storage (database) and replace it by the required ingredient (data). This operation is known as eviction.

• Eviction requires two steps: writing data to disk then writing new data from disk into buffer.

Least Recently Used list (LRU)

Asynchronous Page Writer (APW)

• Remember: Eviction requires two steps: writing data to disk then writing new data from disk into buffer.

• What if you recruit a new agent (let we call it APW) who; in background; will write dirty buffers* to disk then mark them as available. By this way, the database server has only to write data from database to buffer. So, only one operation!!

• *Dirty buffer: is a buffer which was changed on memory but is not yet written to disk.

Checkpoints

• A Checkpoint is a process by which the in-memory and on-disk state of the database are reconciled.

• As transactions are executed, changes are made on buffers into memory. The version that is on disk becomes progressively more obsolete. During a checkpoint, all database changes are written to disk, making the volatile (data into memory) and stabile copies (data into disk) consistent.

• During checkpoints, NO transaction activity can take place.

• Asynchronous Page Writers (APWs) minimizes this overhead by periodically writing modified buffers to disk, so that when a checkpoint occurs, fewer buffers need to be written to disk.

Checkpoint frequency/duration

• To view the checkpoint duration value:• Access PROMON and enter R&D at the main menu.

• Choose option 3, Other Displays, then choose option 4, Checkpoints

• A duration less than 1 second is a good goal.

Shared resources/Latches

• Let we come back to the restaurant example.

• In the restaurant there is only one toilet for all clients.

• The client who gets in first will turn the latch on to lock the door.

• When he gets out, he will turn the latch off. So the door is open and the toilet is available.

• LRU list is a shared resource with exclusive lock. Only one APW can update it at a time.

• What happen if there are many APWs which need to update the LRU?

Spin/Nap

• When a APW needs to update the LRU (Least Recently Used) list, it attempts to acquire the LRU’s latch. If it cannot acquire the LRU’s latch, it repeats the attempt. This iterative process is called spinning.

• If a APW fails to acquire the LRU’s latch after a specified number of spins, the APW pauses or takes a nap before trying again.

• If a APW repeatedly fails to acquire a latch, the length of its nap is gradually increased.

• You can set the Spin Lock Retries (-spin) parameter to specify how many times to test a lock before napping.

Tuning -spin

• To view the resource waits value:• Access PROMON and enter R&D at the main menu.

• Choose option 3, Other Displays, then choose option 1, Performance Indicators, to view the resource waits. Resource waits is the last item reported in the listing. Less then 10/s is OK.

• High “Latch timeouts” could indicate low –spin

Be careful!!

• By setting the -spin value higher, you can reduce the resource waits.

• Continuing to set it higher can adversely effect CPU utilization.

Latch counts

• LRU, MRU, PWQ and CPQ are not the only latches used by Progress.

• To see all the latches:• Access PROMON and enter R&D at the main menu then enter DEBUG menu.

• Choose option 6 (hidden menu), then choose option 11, Latch counts

OM Latch

The Storage Object Cache is used to hold copies of information about the most frequently

accessed objects so that they can be found quickly, with minimal overhead.

Tuning OM Latch

• OM (Object Cache) Latch activity can be totally eliminated by setting the -omsize parameter equal to or greater than the number of _StorageObject records.

Tuning APW

• The PROMON R&D option Page Writers Activity display shows statistics about APWs running on your system.

• Nonzero numbers in the Flushed at Checkpoint row indicates that the APW was unable to write buffers fast enough to prevent a memory flush.

• Increase the number of APWs and/or increase the cluster size to eliminate the flush.

Be careful!!

• The PRO of using APWs is to improve performance by:• Ensuring that a supply of empty buffers is available. So the database engine

doesn’t have to wait for database buffers to be written to disk

• Reducing dirty buffers in memory. When a checkpoint occurs, fewer buffer must be written.

• The CON of using Too Many APWs is to create many conflicts on the LRU’s latch.

Database block size

A second way to decrease disk I/O operations is to use larger block size.

• Disk I/O transfer is made by block. Larger block size provides greater I/O efficiency.

• On Linux, 8kb may be used with kernel versions 2.6 and newer.

SlowerFaster

Before Image

• Before image is always enabled to let PROGRESS recover transactions if the system fails. This mechanism is very important for the DB reliability, but it creates a significant amount of I/O that can affect performance.

• PROGRESS must always record a change in the BI file before it can record a change in the database and after-image (AI) files. If BI activity creates an I/O bottleneck, all other database activities are affected.

Monitoring Before Image activity

• Use the PROMON utility to monitor specific BI activity. Use the R&D option BI Log Activity. The following figure shows a sample display.

Before Image Writer

• The BIW is a background process that continually writes filled BI buffers to disk.

• BIW ensures that a supply of empty buffers is available to client and server processes.

• When a BIW is not started, the database manager assumes the responsibilities of flushing bi buffers from shared memory to disk

Tuning BI buffers

• Increasing the number of BI buffers increases the availability of empty BI buffers to client and server processes.

• Increase BI buffers if there are any empty buffer waits in the PROMON Activity screen

• Increase BI buffers if “Writes by BIW” is not too high. “Writes by BIW” is a percentage of the total number of BI Blocks to disk.

BI block size

• PROGRESS reads and writes information to the BI file in blocks. Increasing the size of these blocks allows PROGRESS to read and write more data at one time. This can reduce I/O rates on disks where the BI files are located.

BI cluster size

• The BI file is organized into clusters on disk. As PROGRESS writes data to the BI file, these clusters fill up. When a cluster fills, PROGRESS must ensure that all modified database buffer blocks referenced by notes in that cluster are written to disk. This is known as a checkpoint.

• Raising the BI cluster size increases the interval between checkpoints.

• Raising the BI cluster size can reduce the I/O overhead of writing modified database buffers to disk.

• It also lets you defer writes and collect more changes before writing a block; this lets you write multiple changes with the same write.

BI cluster size

• Larger cluster sizes generally increase performance. They also have drawbacks:

• Increased disk space usage for the BI file.

• Longer crash recovery periods.

• Longer checkpoint times.

Index

Using bad indexes can lead to bad performance.

• If the CPU is Waiting while the database server is busy searching for required data Client waiting too much Bad performance

• The PRO of having appropriate index helps the disk controller to pick the right data with minimal disk I/O operations.

• The CON of having too many index is that every update of a record will require updates to all index related to the fields which have been updated. Therefore, one record update might be multiple I/O operations.

Locked records

• Long transactions can lead to too many records locks.

• Database server will in this case wait to access the locked record.

• Check “Rec Lock Waits” in promon monitor and try to keep this percentage as low as possible by modifying the application to perform shorter transactions.

Tuning Locked records

• To find who is performing the most locks:• Access PROMON and enter R&D at the main menu.

• Choose option 3, Other Displays, then choose option 3, Lock requests by yser, to view the resource waits. Resource waits is the last item reported in the listing. Identify the user number with the most locks.

Appserver agents

• Appserver agents number should be set up correctly to avoid bad performance.

• Minimum servers: Progress Corp recommends one Appserver per each CPU in the machine.

• You can expect each Appserver agent to handle requests from up to 20 Appserver clients (or more).

Technology

Progress Openedge performance management