2
Getting the right response by BRIAN HUNT I? erformance monitoring has, on occasions, been described as an art rather than a science. Its central aim is to optimize system performance in the same manner that the tuning of a car will increase its overall efficiency. Many methods have been developed to accomplish this, some with more success than others, but most have had little in the way of a true scientific basis. The demand for a more logical approach to ~rfo~~ce management and capa- city planning has therefore produced a new generation of software that is rapidly becoming established as the standard for the industry. The criteria for any performance monitoring system is to ensure con- sistent and good response times for a data centre’s customers. Accordingly, a data centre must establish goals and guidelines for the workloads that are to be processed. This means that workloads should be categorized according to the needs of end users so that during times of peak load the system can decide which to process first. Upon completion of this, an installation should be able to compare performance to objectives and decide if any adjustments are required. Abstract: Perf~nce monitoring and cap- acityplanning are both vitally important to a data centre if it is tofunction well. Guidelines can be drawn up which outline the require- ments from such software, butjust as import- ant is user reaction to identify problems in a system. Keywords: data proces~ng, sofrware tech- niques, capacity management. Brian Hunt is technical support manager for Candle Service L.td. User reaction Historically, all phases of this process have proven difficult in many data centres. The first area of difficulty is how an ~st~lation decides the importance of one work element compared to another. This question is probably answered by asking end users about their business require- ments. Once a series of targets or service level objectives are implemented, the installation needs a method of decid- ing if its targets are being met. Not long ago, a performance monitor would be an additional piece of hard- ware often rivalling the CPU in price. Now, there are many relatively in- expensive software and hardware monitors available and the problem facing the data centre is which one to choose and why. Rules of thumb The traditional method of perform- ance monitoring is based on ‘rules of thumb’ (ROTS). These are guidelines that have been observed on systems delivering good service and, in many cases, are backed by mathematical formulae. For example, on MVSi370, a disc should not be more than 30% busy; the CPU should be no more than 60% busy when running IMSNS; and swap discs on MVSi370 should be kept below 60% busy. Many of these ROTS date back to systems that only ran batch jobs and are sufficient in cases where a minor part of the job is dependent on pro- cessing time. The remainder of the time is spent setting up the job, printing the output and delivering the printout to the users. In these cases, users are probably expecting a total turnaround time measurable in hours, and a five minute delay in processing will be completely transparent. In recent years, the focus of data processing has changed, with many functions now being done by online systems. In these environments, the application of the rules of thumb would lead to the assumption that a disc which is 45% busy is causing poor performance of the online system. The next step is to analyse device activity, patterns of data refer- ence on the disc, dataset blocksizes, etc. Only in the latter stages of this process is an ex~ination carried out of which jobs are using the device, by using in-depth tools. ft may transpire that the disc in question is not being used by the poorly performing work- load. The process would then have to be repeated in an attempt to identify the device that might be causing the poor ~rform~ce. This process can be repeated many times, and it is likely that it would take so long that the problem will have changed while its cause is being located. In fact, in some cases the work may have been delayed for reasons that do not show up using this type of search. This whole technique can therefore lead to installations having intermittent problems that never get properly diagnosed. In this environment the purchase of new hardware may cure the problems, but it is most likely that the cure will come about as a result of changes that are made to move work to the new hardware, rather than through any focused approach to problems. This can lead to unnecessary purchase of hardware in the belief that buying a new CPU or a new bank of disc drives cured the problem last time and will cure it again. Conversely, it should be noted that many installations have installed new hardware to fix per- formance problems, and are dis- appointed by the results. ~0128 no 6 julyiaugust 1986 0011~84~~6/~02~2$03.~ @ 1986 Buttenvortb & Co (Publishers) Ltd. 299

Getting the right response

Embed Size (px)

Citation preview

Getting the right response by BRIAN HUNT

I? erformance monitoring has, on occasions, been described as an art rather than a science. Its

central aim is to optimize system performance in the same manner that the tuning of a car will increase its overall efficiency. Many methods have been developed to accomplish this, some with more success than others, but most have had little in the way of a true scientific basis. The demand for a more logical approach to ~rfo~~ce management and capa- city planning has therefore produced a new generation of software that is rapidly becoming established as the standard for the industry.

The criteria for any performance monitoring system is to ensure con- sistent and good response times for a data centre’s customers. Accordingly, a data centre must establish goals and guidelines for the workloads that are to be processed. This means that workloads should be categorized according to the needs of end users so that during times of peak load the system can decide which to process first. Upon completion of this, an installation should be able to compare performance to objectives and decide if any adjustments are required.

Abstract: Perf~nce monitoring and cap- acity planning are both vitally important to a data centre if it is to function well. Guidelines can be drawn up which outline the require- ments from such software, but just as import- ant is user reaction to identify problems in a system.

Keywords: data proces~ng, sofrware tech- niques, capacity management.

Brian Hunt is technical support manager for Candle Service L.td.

User reaction Historically, all phases of this process have proven difficult in many data centres. The first area of difficulty is how an ~st~lation decides the importance of one work element compared to another. This question is probably answered by asking end users about their business require- ments.

Once a series of targets or service level objectives are implemented, the installation needs a method of decid- ing if its targets are being met. Not long ago, a performance monitor would be an additional piece of hard- ware often rivalling the CPU in price. Now, there are many relatively in- expensive software and hardware monitors available and the problem facing the data centre is which one to choose and why.

Rules of thumb The traditional method of perform- ance monitoring is based on ‘rules of thumb’ (ROTS). These are guidelines that have been observed on systems delivering good service and, in many cases, are backed by mathematical formulae. For example, on MVSi370, a disc should not be more than 30% busy; the CPU should be no more than 60% busy when running IMSNS; and swap discs on MVSi370 should be kept below 60% busy. Many of these ROTS date back to systems that only ran batch jobs and are sufficient in cases where a minor part of the job is dependent on pro- cessing time. The remainder of the time is spent setting up the job, printing the output and delivering the printout to the users. In these cases, users are probably expecting a total turnaround time measurable in hours, and a five minute delay in processing will be completely transparent.

In recent years, the focus of data processing has changed, with many functions now being done by online systems. In these environments, the application of the rules of thumb would lead to the assumption that a disc which is 45% busy is causing poor performance of the online system. The next step is to analyse device activity, patterns of data refer- ence on the disc, dataset blocksizes, etc. Only in the latter stages of this process is an ex~ination carried out of which jobs are using the device, by using in-depth tools. ft may transpire that the disc in question is not being used by the poorly performing work- load. The process would then have to be repeated in an attempt to identify the device that might be causing the poor ~rform~ce.

This process can be repeated many times, and it is likely that it would take so long that the problem will have changed while its cause is being located. In fact, in some cases the work may have been delayed for reasons that do not show up using this type of search. This whole technique can therefore lead to installations having intermittent problems that never get properly diagnosed.

In this environment the purchase of new hardware may cure the problems, but it is most likely that the cure will come about as a result of changes that are made to move work to the new hardware, rather than through any focused approach to problems. This can lead to unnecessary purchase of hardware in the belief that buying a new CPU or a new bank of disc drives cured the problem last time and will cure it again. Conversely, it should be noted that many installations have installed new hardware to fix per- formance problems, and are dis- appointed by the results.

~0128 no 6 julyiaugust 1986 0011~84~~6/~02~2$03.~ @ 1986 Buttenvortb & Co (Publishers) Ltd. 299

Alternative solution

The solution lies in a different approach to performance monitoring. The change required is for a data centre to stop looking at ROTS and to look at performance using the same measurement system as end users.

In this type of approach the data centre uses the service level objectives to decide if the user is getting his or her completed work back in the required time, rather than worrying if the hardware is working correctly. Why were ROTS used at all? The answer lies partly in the previous lack of good measurement tools at a reasonable price, but more important- ly it is a result of an overlap with capacity planning.

Capacity planning is deciding how much an installation needs in order to cater for planned growth while still being able to achieve unplanned growth if necessary.

It requires that data centres know the capacity of their existing hardware and this is where rules of thumb come into play. It is im~ssible to ‘capacity plan’ without having a solid base to work from. Many installations have tried to combine performance moni- toring and capacity planning into one operation using ROTS for a purpose to which they are not particularly suited. It is more important for an installation to get its performance monitoring systems working cor- rectly. Otherwise all growth projec- tions will be starting from a weak base that could inflate hardware require- ments and result in unnecessary cost.

It is apparent that the best way to measure performance is from the users’ viewpoint so that the data centre can be sure that it is achieving consistency. The performance moni- toring software can then tell if the accounts batch jobs are on time, if online systems are being correctly served by the system, and if the transactions that run as part of online systems are each giving good response times. To achieve this, performance monitoring software is needed which

300

allows service level objectives to be defined. In turn, these objectives can identify whether service levels are being met.

Making the correction

Having completed stage one of the performance monitoring cycle a rela- tionship needs to be established bet- ween poor response time and the resource that is the cause of the problem. To perform this second stage it is necessary to measure what the delayed workload is doing so that its elapse time can be attributed to each of the resources that it uses. This will result in a series of times or percentages, with the highest figure relating to the resource that needs to be analysed. This may not be a device that would have shown up using the rule of thumb based system. It may even be a device that is not typically busy. It is, however, the device that the workload spends most of its time on, and therefore the one that needs to be analysed if the time required is to be reduced.

The third stage in a performance monitor is the ability to analyse the resource in depth to see if the problem is caused by the poorly performing workload, or an interaction of several workload elements.

Once the problem has been identi- fied and solved, the cycle returns to the first stage, with the monitoring of all workloads against their service level objectives. This cyclic process can continue forever, but on most iterations stage one will show that all workloads are performing according to their objectives and that no further action is required.

As defined so far, this system is only triggered when service level objectives are not being met. There are events that can happen that are known to result in poor performance or system failure. Examples of these are CICS going ‘short on storage’, IMS entering ‘selective despatching’, MVS being close to total CSA exhaustion and VM suffering high

rates of ‘free storage expansion’. If the monitoring system warned of the approach of these situations, it would be possible to predict poor response time before it occurred, and take corrective action.

Summary

All problems that an installation will suffer are the result of change. These changes may be workload growth, workload mix, software or hardware changes. If an historical system is added to the realtime requirement as defined so far, it would be possible to see the profiles of previous workloads, if changes are occurring there, and if the hardware or software has changed in a manner that may cause perform- ance problems. By adding this histori- cal facility it would also be possible to verify whether a CPU upgrade was the cause of performance improve- ment, or if improvement was a result of other unplanned changes.

A checklist of requirements can be drawn up for a performance monitor- ing system to:

check for abnormal conditions check actual performance against service levels identify why poorly performing workloads are delayed analyse all possible delay reasons/ resources check workload size check workload mix check software definition actually in use check changes to software defini- tion check hardware actually in use check changes to hardware defini- tion

Software that allows these ten re- quirements to be met means that the data centre will be in a position to control its workload rather than have the workload control it. q

Candle Service L.td., 13-15 John Adam St, London WCZN 6LU, UK.

data processing