September 2004Page 1 Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt

September 2004 Page 1

Mainframe Global and Workload Levels Statistical Exception Detection System,

Based on MASF

Igor Trubin, Ph.D.

and Linwood Merritt

Capital One Services, Inc.

[email protected]


Introduction: Environment

• Capital One– 6th largest card issuer in the United States

– Capital One to S&P 500 in 1998

– Fortune 500 company starting in 2000

– Managed loans at $71.8 billion

– Accounts at 46.7 million

– CIO 100 Award “Master of the Customer Connection”

– Information Week “Innovation 100” Award Winner

– ComputerWorld “Top 100 places to work in IT”


Statistical Analysis of Mainframe Performance Data

• SEDS - Statistical Exception Detection System based on Multivariate Adaptive Statistical Filtering (MASF) technique.

• SEDS is used for automatically scanning through large volumes of performance data and identifying measurements that differ significantly from their expected values.

• MASF is extension of Statistical Process Control or (Quality Control), which was developed by Walter Shewhart of Bell Telephone Laboratories in the 1920s.

• MASF procedure was designed and presented in CMG by BGS Systems, Inc. in 1995.

• SEDS is developed by this author and presented as the best paper in CMG 2002.


Review of the Existing Tools

– SAS/QC (Quality Control):– JMP from SAS:– BEZsystems

for Oracle and Teradata; – Concord eHealth – DFN

(Deviation From Normal)

– The Patrol Perform and Predict tool from BMC software:

The common output is Control charts for monitoring variations in process under statistical control


SEDS Structure

– Exception detectors for the most important metrics;

– SEDS Database with history of exceptions;

– statistical process control daily profile chart generator;

– exception server name list generator;

– Leader/Outsider servers/workload detector and detector of defective (runaway) processes ; and

– Leaders/Outsiders bar charts generator.

Multiplatform environment

CPU Util./MIPS

CPUQueue

Disk IOrate

MemoryPage Rate

MemoryUtil.

Da

taC

ol-

lec

t io

n

SPC daily profile charts(see slide 6 as example)

Exception serverand Appl. name

lists

Global Exception Detectors (SAS program)

e-mailnotification

Web publishing

Runaway processand server/appl.

leaders detectors(SAS program)

Leaders/Outsiders bar

charts (see slide18 as example)

History ofexceptions

(EDSDatabase,

SAS dataset)

Performance Data Base(SAS/ITRM)

for Unix,NT,Tandem, Unisys, and MVS servers

Exception?

No

Yes Appl. Exception Detectors (SAS program)

CPU Util./MIPS

# of activeprocesses

Disk IOrate

Ad-hoc analyses


CPU Utilization Control Chart for Web Report:

The full "7 days X 24

hours” adaptive filtering policy is applied to calculate the average, upper, and lower statistical limits of a particular metric for each weekday for the past six months.


SEDS against Unisys and Tandem Platforms Performance Data

Variable Data Type Description

CPUTOT PERCENT1 CPU Utilization

DATETIME DATETIME Datetime

DURATION TIME Duration of interval

HOUR GAUGE Hour the event occurred

LSTPDATE DATETIME Last process date

MACHINE STRING Machine Name

MAVAILK FLOAT Memory Available Kword

MEMNUSE PERCENT1 Memory in Use

MONTH FORMULA

MOVRLY PERCENT1 Memory Overlayable

READYQ FLOAT CPU Ready Queue


CPUNUM STRING CPU Number

CPUQUE INT Run-queue

CPUTOT INT CPU Utilization

DATETIME DATETIME Datetime of sample/event


DISK STRING Disk Name

DISKIO RATE Disk I/Os Per Second


DUTIL PERCENT1 Disk Utilization

EXTENT INTMaximum Free Extent

(Megabytes)

HOUR GAUGE Hour Summary Variable


MACHINE STRING Machine Name

MEMUTIL INT Memory Utilization

MONTH FORMULA

SPFREE INT Free Space (Megabytes)

SPUSED INT Space Used (Megabytes)

SWAPS RATEMemory Page (4K) Swaps per

Second

SEDS works with hourly or daily performance data.

The schemas of the “day” tables in ITRM for Unisys and Tandem platforms are shown here.

Good candidates to be used for SEDS are marked by red.


Examples of Captured Exceptions for Unisys and Tandem

The Unisys server had unusual low utilization that might indicate Disk or Database performance problems

The Tandem server, in contrast, had two unusual spikes of CPUs utilization that crossed the upper limit.


Global Performance Data for MVS Platform


CLUSTER STRING Cluster Name (Simplex name)

CPUMIPS INT CPU Cycles Used (MIPS)

CPUMIPX INT CPU Processor Maximum MIPS

CPUTOMX FORMULA CPU Utilization for Interval Max

CPUTOT FORMULA CPU Utilization for Interval Average


DISKIO FLOAT Disk I/O (EXCPs)


HOUR INT Hour Summary Variable

LPAR STRING Logical Partition


MACHINE STRING Server Name (Footprint)

MEMNUSE PERCENT1 Total Memory Utilization

MONTH FORMULA

READYQ FLOAT CPU Queue length

SHIFT STRING Shift of work week.

A set of nightly batch jobs - dumps remaining active accounting data, - consolidates the data,- processes the data in SAS and - updates the ITRM PDB

The schemas of the “day” tables in ITRM for MVS platforms are shown in the Table

Good candidates for use in SEDS are marked by red


Examples of Captured Exceptions for One of the Logical Partitions (LPAR)

Since this chart is not about the entire system’s utilization but only about LPAR utilization within a shared system, the problem is that 100% is not a true threshold.

However, SEDS gives a more accurate and dynamic threshold which is a statistical one.


BMC Visualizer MASF vs. SEDS

You can use BMC Visualizer to find any other exceptions based on other filtering policies. For that, the BMC collector needs to be installed on the server and BMC Visualizer must be used manually to capture any MASF exceptions.

BMC Visualizer example: the System Hierarchy (spectrum) and Control charts

SEDS is preferable as the automated MASF chart generator.

In addition, SEDS can automatically notify a performance analyst if the statistical exception occurred


Application Level SEDS for MVS PlatformOne problem is that, based on LPAR level data, it is impossible to figure out what particular workloads are responsible for an exception.

BUT the Data collection process provides application level data across all LPARs.

Looking at a stacked workload data chart , it’s difficult to find an application, which is responsible for spikes in overall CPU usage.

SEDS shows that Appl1 was responsible for the global maxima in the overall MIPS chart .


Other Reasons to Generate a Workload Control Chart

1. To capture an unusual behavior of a relatively small application that was not big enough to create a global exception:

2. To prove a stable behavior of any essential or

critical application:


Service Class/Period Type of Metrics under SEDS

- Hourly SUM of the average response per transaction - RESP,(It shows the values consistently larger than average)

- Hourly SUM of ended transaction count - TRANS

- Hourly SUM of elapsed tasks duration -

CPUsec (not always reported correctly for long-running servers )

ElapsedSec = (number of tasks) *

3600 seconds.


Performance Status Automatic Recognition, WEB Report and E-mail Notification

A green color in the WEB table indicates no exceptions.

A Magenta indicates that the exceptions only exceeded the lower limit.

A yellow color means an exception occurred on a particular server or LPAR.

(NUP - NLOW) –

Is the severity or type of the exceptions under the link to an MASF chart, where

NUP – number of upper limit exceptions and

NLOW – number of lower limit exceptions during the previous day.

the number of applications or Service Classes with exceptions


Links to the Workload Control Charts


ExtraVolume is the numeric estimation of the exception magnitude.For CPU utilization it’s an ExtraTime:

It calculates the area between the limit curve and the actual data curve (for periods when the exceptions occurred).

For CPU metrics the physical meaning is the CPU time (or MIPS) the server has taken that exceeds a standard deviation.

NAME - server name; DAYMEAN - metric daily average; _FREQ_ - number of weekdays in the server's data history (must be >6); NUP - number of upper limit exceptions; NLOW - number of lower limit exceptions; DATE - exception appearance date; PLATFORM - server configuration; METRIC - performance metric name. ExtraVolume - …..

Exception Database and “Extra Volume” Metric

The SEDS database keeps history of exceptions and has the following structure:


TOP LPAR Leaders/Outsiders Charts

– The system automatically produces ExtraTime calculation for the last day and records that in the SEDS database.

– This data is used for publishing Leaders/Outsiders charts bar charts for the last day, last week and last month. If the SERVER showed a positive ExtraVolume for the

previous day, it means that more capacity was used on the server than in the past.

If the server showed a negative ExtraVolume metric, less capacity was used than usual. (not necessarily good thing)


SUMMARY • Statistical techniques can be used to automatically detect and report

exceptions in resource utilization and service levels.

• The author’s site previously used MASF techniques to track global and application level CPU, disk and memory exceptions for a large number of UNIX and WINTEL servers .

• The workload level analysis enabled the authors’ site to expand the scope of this process to encompass large mainframe class servers.

• Although the analysis of global exceptions at an LPAR level has limited value for a system that shares workloads across logical systems, a workload-oriented system allows for quick detection of exceptions and immediate drill-down capabilities for the Capacity Planner and Performance Analyst.

• The authors recommend that the reader evaluate and understand any built-in statistical processes within his/her product set and consider developing ways to notify appropriate analysts when exceptions occur.


References

• Trubin, Igor, Ph. D. and Mclaughlin, Kevin, “Exception Detection System, Based on the Statistical Process Control Concept," Proceedings of the Computer Measurement Group, 2001

• Trubin, Igor, Ph. D., "Global and Application level Exception Detection System, Based on the MASF Technique," Proceedings of the Computer Measurement Group, 2002

Thanks!Igor Trubin

IT Capacity Planning, Capital One Services, Inc.

[email protected]

http://www.cmg.org/measureit/shared/trubin_02.pdf

http://www.cmg.org/measureit/shared/trubin_02.pdf

Documents

September 2004Page 1 Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt