Performance Anomally Detection for Dbms-2012

7/21/2019 Performance Anomally Detection for Dbms-2012

1/16

A Performance Anomaly Detection and AnalysisFramework for DBMS Development

Donghun Lee, Sang K. Cha, Member, IEEE, and Arthur H. Lee

AbstractDetecting performance anomalies and finding their root causes are tedious tasks requiring much manual work.

Functionality enhancements in DBMS development as in most software development often introduce performance problems in

addition to bugs. To detect the problems as soon as they are introduced, which often happens during the early phases of a

development cycle, we adopt performance regression testing early in the process. In this paper, we describe a framework that we

developed to manage performance anomalies after establishing a set of conditions for a problem to be considered an anomaly. The

framework uses Statistical Process Control (SPC) charts to detect performance anomalies and differential profiling to identify their root

causes. By automating the tasks within the framework we were able to remove most of the manual overhead in detecting anomalies

and reduce the analysis time for identifying the root causes by about 90 percent in most cases. The tools developed and deployed

based on the framework allow us continuous, automated daily monitoring of performance in addition to the usual functionality

monitoring in our DBMS development.

Index TermsCUSUM chart, differential profiling, statistical process control (SPC), performance anomaly, DBMS.

1 INTRODUCTION

PERFORMANCEconcerns in a software product are usuallyserious enough to cause delays in deployment andsometimes reported as one of the most critical problems ofdeployed systems as documented in [10]. As we have beendeveloping the DBMS engine PTIME[28], one of the mostcritical challenges has been fighting occasional performancedegradations. With the advent in hardware technology andnew demands from various database applications we have

seen opportunities for enhancements in DBMS functional-ities. With each new release of the system, we try to gain orat least maintain a good performance level in variousperformance metrics, but we often see new performancechallenges as we add functionality enhancements.

Detecting performance problems and identifying theirroot causes are not simple and intimately related to thetesting aspect of software development. Many traditionaltesting techniques put their focus toward the end of adevelopment process [13]. This often leads to a delayedfeedback on the potential performance problems thusincreasing the cost to fix them. The longer the interval

between two consecutive tests, the more difficult it is to findtheir root causes. This leads to the idea of conducting testsincluding performance tests as often as the process allowsnot only during the latter but also earlier phases of thesystem development cycle. The difficulty lies in detecting

the anomalies and finding their root causes efficiently withas little of disruption to the development process aspossible. What is worse, the development continues thuspossibly affecting some of the performance metrics con-tinuously often in some unexpected places and directions.Furthermore, performance values show inevitable varia-tions affected by various factors such as workload bymulticlients, internal thread scheduling, or the test environ-

ment. These variations make it difficult to establish abaseline for performance and an error margin to be used todetermine if a change in performance should be recognizedas a problem or not. All these point to an automatedmechanism into the development process with an appro-priate set of metrics to monitor.

In this paper, we present an automated softwaredevelopment framework by which we can guarantee asustainable level of performance while developing a soft-ware system, e.g., a DBMS. By using the framework onecould eliminate almost all manual overhead in performancetesting to detect performance anomalies and find their root

causes during regression tests. The framework consists ofthe following components among others: GUI-based per-formance monitor, Performance Anomaly Detector (PAD),and differential profiler.

As a key enabling concept for the framework, we useStatistical Process Control (SPC). In a manufacturingprocess this has been used widely to monitor a specificset of parameters to detect anomalies, and various controlcharts used for SPC have also been researched forapplication [7]. We found it feasible to apply SPC toperformance monitoring in software development. Amongvarious control charts our study has mostly focused on the

feasibility of Cumulative Sum (CUSUM) charts becausethey are sensitive to small changes in performance. We alsopresent our findings on the issues of applying them to asoftware development process. To reduce the overhead of

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 8, AUGUST 2012 1345

. D. Lee and S.K. Cha are with SAP Labs Korea, Inc. 20,21F, 235, Banpo-Daero, Seocho-gu, Seoul 137-040, Korea.E-mail: {dong.hun.lee, sang.k.cha}@sap.com, [email protected].

. A.H. Lee is with the Department of Mathematics and ComputerScience, Claremont McKenna College, 850 Columbia Avenue, Clar-emont, CA 91711. E-mail: [email protected].

Manuscript received 29 Oct. 2010; revised 8 Feb. 2011; accepted 10 Mar.

2011; published online 28 Mar. 2011.Recommended for acceptance by J. Haritsa.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2010-10-0572.Digital Object Identifier no. 10.1109/TKDE.2011.88.

1041-4347/12/$31.00 2012 IEEE Published by the IEEE Computer Society


2/16

identifying the root causes of anomalies we chose to use aprofiling approach that utilizes readymade profiling datacollected during regression tests. Generating profiling dataon demand takes an unacceptable amount of time if wewere to generate them after an anomaly has been detected.

Our review for the key performance metrics is being done

daily now, which used to be done weekly or monthly beforethe automated framework was adopted. We have noweliminated most of the manual work (about one person/day) that we used to do to monitor about 100 metrics in about10 tests. Generating a report on the suspect changes once ananomalyhasbeendetectedisnowtakingonlyabout2-3hoursas we make use of readymade profiling data as opposed toabout 1.5 days on average in the past.

In Section 2, we present our background project and theoverall process that this paper deals with. In Section 3, wedescribe how we apply SPC charts in finding a performanceanomaly. In Section 4, we introduce sampling-based

differential profiling to investigate the root causes of ananomaly. We present our framework to embody these twomain approaches with an account of its effectiveness inSection 5. The related work is presented in Section 6, andSection 7 concludes the paper with potential future work.

2 BACKGROUND

2.1 In-Memory DBMS P*TIME

Since 1998, we made an academic focus on inventing newtechnologies to support scalable performance based on anew DBMS architecture. PTIME is a full-fledged in-

memory DBMS supporting transactional concurrency con-trol, logging, recovery, SQL 92 with some extensions, cost-model-based SQL processor, and standard RDBMS APIs. Itsstorage engine layer incorporates our previous innovationsfor exploiting engine-level microparallelism using patentednew technologies such as differential logging [17] andOptimistic Latch-Free Index Traversal (OLFIT) concurrencycontrol protocol [29]. It manages performance-critical dataand indexes primarily in the memory of a single multi-threaded process and supports highly scalable durable-commit update transaction processing performance, highlyscalable fast database recovery, and superior multiproces-

sor scalability by eliminating the well-known index lockingbottleneck described by Cha and Song [28].P

TIME has successfully been in production since 2002

as the mobile stock market database server at Samsung

Securities in Korea, one of the world-largest online stockbrokerage companies. Since 2003, PTIME has beendeployed in CDMA, broadband, internet trading, banking,and wireless internet fields in Korea, before it was acquiredby SAP in September 2005.

The basic concepts of the approaches presented here

were developed during P*TIME development and haveevolved over the years, and the framework presented in thepaper has been successfully merged into the developmentprocess of MaxDB [22] and the new in-memory datamanagement platform in SAP which was announced inSAPPHIRE NOW 2010 [18].

2.2 Performance Anomaly Management Process

In this section, we present the overall view of our process onmonitoring the performance of our system and managingperformance anomalies that we encounter during develop-ment. As can be seen in Fig. 1, the overall process consists of

packaging and testing of the system, detecting anomalies,investigating anomalies for root causes with profiling data,and repairing them.

Generally one of the main difficulties in providing anearly feedback for performance anomalies is that an almostfully developed or at least running system is needed forperformance testing. To deal with this issue, we adoptedagile development methodology in the development pro-cess. To maintain the source code to be operational at alltimes, we implement a continuous build/test infrastructure.Newly submitted code changes are tested for a set of keyperformance metrics as well as their functionalities before

they are moved to the next consolidation branch.When a new package is built successfully during a

continuous build process, it is saved in a repository andseveral functionality and performance tests are executed forit. During the performance tests, measurements for severalselected key metrics are made. Each target system has a setof key performance metrics that we monitor. For these keymetrics various performance tests are run and it is possibleto make measurements for several metrics in a test. Someexamples of key metrics follow. To monitor the perfor-mance of OLTP workload we measure the performance ofeach of the select, insert, update, and delete operations in

terms of Transactions Per Second (TPS) by running thesimple query statements multiple times. We monitor theexecution time for various complex queries for OLAP data.We monitor the execution time for bulk loading using

1346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 8, AUGUST 2012

Fig. 1. Conceptual diagram of the performance anomaly management process.


3/16

massive amounts of data. We also monitor specific metricsthat are defined in benchmark tests such as SD benchmark[32]. In addition to monitoring the performance of thesetests we also monitor some hardware related metrics such

as memory consumption and CPU consumption bymeasuring them while running those tests.When a test finishes, the results are analyzed to see if

there is any performance anomaly. If one is found, an alarmreport is issued. One of the most common approaches usedto investigate a performance anomaly is through perfor-mance profiling, which allows us to look into the internalbehavior of the system. To find the root causes of ananomaly we generally compare the profiling data of thenew package that contains the problem with the most recentpackage that is known to be stable. Based on the profilingdata we search the suspect code changes introduced duringthe interval between two packages. The suspicious codechanges are compiled and reported as the candidate rootcauses in an analysis report. The developer responsible forthe changes that have been flagged as suspects checks hischanges and repairs the performance problem.

3 PERFORMANCE ANOMALY DETECTION

This section presents an automated scheme for detectingperformance anomalies. In Section 3.1, we first review someissues that we must deal with as we try to detect anomaliesthrough performance regression testing. In Section 3.2, wediscuss the feasibility of applying an SPC chart to our

situation. SPC charts have widely been used in manufactur-ing processes to monitor consistency of production pro-cesses. In Section 3.3, we discuss the design of a CUSUMchart, which is one of the several SPC charts that we

actually adopted in our scheme. Finally, in Section 3.4, wedescribe the implementation issues that we had to addressas we applied CUSUM charts to our scheme; we discuss indetail how the target mean value was determined and how

a reasonable subgroup size was determined throughsimulation to be able to detect changes in performance assmall as 1:5 (1.5 times standard deviation).

3.1 Issues on Performance Anomaly Detection

We define an unwanted change in performance to be aperformance anomaly. The change can be positive ornegative and both cases are considered an anomaly. Anunexpected performance improvement means that we didnot fully understand the internal behavior of the system inhand, therefore even a positive change needs to beinvestigated to have a better understanding of the system.There are three main issues that we must deal with as we try

to detect a performance anomaly: 1) how to determine thebaseline of performance of the system in hand, 2) once thebaseline is determined, how much of deviation in perfor-mance from the baseline should be considered an anomaly,and 3) how much of what we have to do to repair theanomaly can be automated minimizing manual interventionin the overall performance testing process.

Fig. 2 shows the results of some of the queries as weexecute TPC-H benchmark [36] at different times for thedevelopment branch of our system during the last sixmonths. We run the test multiple times for a packagecreated at a particular point in time. The resulting

performance values (samples) of these multiple testscollectively form a subgroup (or a run). That is, a subgroupis a group of measurements obtained by running a testmultiple times for the same code in a testing environment.

LEE ET AL.: A PERFORMANCE ANOMAL Y DETECTION AND ANALYSIS FRAMEWORK FOR DBMS DEVELOPMENT 1347

Fig. 2. TPC-H performance graph.


4/16

Each point in the figure represents a normalized value for asubgroup, obtained by dividing the average of all thesamples in a subgroup by the mean for the past six monthsso that we will be able to compare the relative distributionof each performance metric. By a mean value we mean theaverage of the average values of all the subgroups that werecreated during the time interval.

In Fig. 2, we observe a number of noticeable changes inperformance. First, [A-1,2,3] show a noticeable change at acertain point in time and maintains the change from thatpoint on. It is due to a change in the mean value because ofa certain change in the internals of the system. This case iscalled a mean shift. Next, [B-1,2,3,4,5,6] show that therewere temporary changes and then they returned to thebaseline. Each of these cases means that there was ananomaly but it was fixed. The ones we have described so farare the easy ones. Some are hard to tell if they are anomaliesor not as can be seen in [C-1,2,3]. Depending on how farback in time we should consider in determining thebaseline, depending on what we use as the baseline value,and depending on how much of a change is large enough tobe considered an anomaly, some of [C-1,2,3] could beconsidered an anomaly.

Another thing to take into consideration is the variabilityinherent in performance. To monitor the performancemetrics of a software product it is essential to keep a stabletesting environment. This concept is known as the config-uration management in CMMI [4]. To support such anenvironment, we need to maintain a set of dedicated serversto keep the stable test environment where all configurationparameters are set consistently. However, regardless of howwell or carefully maintained it is, a certain amount of

inherent or natural variability always exists.Because of these issues identifying a change in perfor-

mance as an anomaly is not easy at times and we sometimesmiss oneandlater find it out. Other times we waste resources,time, andhumaneffortby investigating on some changes thatare not serious enough to be considered an anomaly.

To remedy situations like these we need to establish astandard that we can apply to each quality metric and a toolbased on the standard to detect anomalies in a systematicmanner. Detecting anomalies involves collecting data,charting them, and estimating the effects, all of which takea considerable amount of manual effort. This overhead

increases even further as the number of metrics and thenumber of tests grow. For example, in the case of TPC-Hbenchmark there are more than 30 metrics which includemeasuring data loading time, executing 22 queries, andcalculating a geometric mean value. As we can see in Fig. 2,an individual change in the code could affect the perfor-mance of different queries. Therefore, monitoring only themain metric such as total query execution or geometricmean value is not enoughwe should really monitor eachindividual metric separately. For each metric, we need togather the history of recent values so that we can chart,analyze, and compare the data and report the result to the

developers if necessary. To minimize this sort of overhead,it would be necessary to develop an automated tool that canbe used to detect anomalies with a minimum amount ofhuman intervention.

3.2 Applying SPC Charts

Statistical process control has widely been used to detectanomalies by monitoring changes in the values of specificparameters of interest in manufacturing and people haveresearched in various control charts applicable to theirdomain [7]. SPC charts are quality control methods that havewidely been used to monitor the consistency of production

processes. SPC plots the measurements observed over timein a chart with a center line and control limits. When anobservation falls on or outside the control limits, the SPCchart is said to detect a change, which implies that there is astrong enough evidence that the process has changed [1]. Wecan also find some papers on applying SPC to softwaretesting process or software development process to detectproblems in the process [16], [25].

In the case of a manufacturing process it is quitecommon to see large anomalies for specific parameters atan early stage of a system building due to some elementsthat have not yet been stabilized. Once the elements thatcause these large anomalies are eliminated, then we beginto see smaller ones.

Among several SPC charts, Shewhart chart [7] is widelyused to detect large anomalies that cause variations that areover 3while monitoring the values of a single parameter.For detecting anomalies that are within 1:5, cumulativesum chart is widely used [7].

The situation is somewhat different when we deal withperformance aspects in a software system development.Depending on the nature of new code that gets introducedto the system during the implementation phase we seesome mix of large anomalies such as [B-4,5] and small onessuch as [B-1, C-1] in Fig. 2 in some unpredictable ways. In

this section, we show that CUSUM charts are moreapplicable in detecting small anomalies than Shewhartcharts using some data that we have collected based onthis observation. We will then show how we used acombined CUSUM-Shewhart chart to take advantage of themerits of the two.

3.2.1 Small Mean Shift Detection by CUSUM Chart

In general, Shewhart chart is known to use only theinformation about the process contained in the last sampleobservation and it ignores any information given by theentire sequence of points [7]. That is, it detects only outliers

using 2or 3limits, but does not see the history or trend.However, CUSUM charts see all the information of asequence of sample values and find small mean shifts.

Fig. 3 shows a sequence of 40 subgroups of responsetimes obtained by executing a query in TPC-H using aShewhart chart. Each subgroup (or a test run) is the averageof two test runs for a package built from the developmentbranch at a specific point in time, i.e., the sample size is 2. Alarger run ID means that the test is done later in time.Although the figure shows that the execution time for thechosen query is increasing slightly over time, the increase isnot large enough to exceed the 3 control limit for a

Shewhart chart.We applied a CUSUM control chart to the same example

shown in Fig. 3 and the result is shown in Fig. 4. In Fig. 4, theCUSUM valueis increasing from therun ID 35 andeventually



5/16

exceedstheuppercontrollimitatrunID40,thussettingoffanalarm. With a CUSUM chart when the majority of the valueshappen to be plotted on one side of the center line, it isrecognized that a mean shift has occurred thus setting off analarm. At run ID 40 the variation is about 3.9 percent of themean value, which is equivalent to 1:72. Even with avariation less than 2 sigma, we can see that the CUSUMcontrol chart is sensitiveenoughto recognizethevariations of

the mean if they occur on one side of the center line.

3.2.2 Combined CUSUM-Shewhart Procedure

As described in the paragraph before Section 3.2.1, we oftensee small and large anomalies being introduced to thesystem as a new piece of code is added to the system.

As we saw in Fig. 4, CUSUM control charts are goodfor detecting small anomalies but not as effective asShewhart charts in detecting large shifts [7]. To increasethe speed of detecting large mean shifts, a combinedCUSUM-Shewhart procedure was proposed by Lucas andCrosier [14]. They showed that the CUSUM chart with the

Shewhart limits added to the individual measurementscan improve the ability to detect larger shifts, and theShewhart control limits should be located approximately3.5 standard deviations from the center line. To detectsmall anomalies and to speed up the process of detectinglarge anomalies we applied the combined CUSUM-Shewhart procedure, and this is also used in our frame-work that we describe in Section 5.

3.3 Design of CUSUM Chart

The cumulative sum chart is one of the most sensitive SPCcharts to signal a persistent step change in a parameter [9].

In a CUSUM chart, we calculate the difference between theaverage for each subgroup and the target mean and whenthe cummulative sum of the differences exceeds the controllimit, an alarm goes off. It can be expressed in an equationas follows [7]:

Ci max0; xi 0K Ci1

;

Ci max0; 0 K xiCi1;

3:1

where initial values areC0C

0 0.

If0 is the target for the process mean, xi is the averageof the ith subgroup of samples, then the cumulative sumcontrol chart is formed by plotting the quantity Ci. Ci iscalled the cumulative sum up to and including the ithsubgroup of samples. C+ and C

are one-sided upper and

lower cumulative sums, respectively, (see Fig. 4). For thelower CUSUM chart in Fig. 4 the values are represented asnegative to make it easier to understand.

As we apply CUSUM charts, there are two design factorsthat we need to determine: reference value K and decisioninterval H. If either statistical C+ or C exceeds a decisioninterval H (or CUSUM control limit), the process isconsidered to be out of control. When we assume that thestandard deviation is known or can be estimated, thereference value (or allowance value) K is represented asKk and the decision interval H is represented asH

h. In our case, we determined k and h which

determine K and H respectively as described below basedon previous work by others [7], [8].

The value for a particular metric within a certain timeinterval would be measured with respect to a mean value0, and at some point later the mean value would change toa new value 1 thus causing a mean shift. When a means hi ft o cc ur s, t he K v al ue w ou ld b e d ef in ed a sK j10j=2. In general when a mean shift as large as1occurs, we regard it as the target for detecting it as ananomaly and assign 0.5 as the value of k [7].

In determining the decision interval H, it turns out thatwhenk 0:5it shows a good performance with the value ofh 4 or 5 according to Montgomery [7]. In our work, wechose h so that the average run length (ARL0) for thedesigned CUSUM chart is almost the same as that of aShewhart chart with a 3limit so that we can compare thecharacteristics of anomaly detection between a Shewhart


Fig. 3. Shewhart chart for a small mean shift.

Fig. 4. CUSUM control chart for a small mean shift (k0:5 and h4:77).


6/16

chart and a CUSUM chart using the same performance data.The Average Run Length (ARL) is closely related to theperformance of an SPC chart and it is the average number ofpoints that must be plotted before a point indicates an out-of-control condition. ARL0 means an in-control ARL, but itmeans the ARL at which a false alarm occurs even though itis in-control. The ARL0 value for a Shewhart control chartwith the usual 3limits is 370. According to Hawkins in [8]to design a CUSUM chart for which ARL0370 with k0:5we should use h4:77. All the CUSUM charts that weused in this paper were designed based on the k0:5andh4:77values.3.4 Issues on CUSUM Control Chart Adoption

We had to consider several issues as we adopt CUSUMcharts in our system. We discuss the effects of a sample sizefirst and then how to determine the target mean value. Wealso discuss a simulation that we used to obtain the mostreasonable subgroup size to detect a change as small as a1:5mean shift.

3.4.1 Sample Size

The sample size for a subgroup can be an issue for theperformance of an SPC control chart. In the case of aShewhart chart it is affected much by the sample size thatmakes up a subgroup: the larger a sample size, the easier itis to detect even a small mean shift. With a CUSUM chartthough it is shown that it works well even when the samplesize is equal to one [9].

Although a sample size of one would work well for aCUSUM chart, in our system we decided to execute eachtest at least twice for each package, thus a sample size of 2.

We chose the size two for the reasons described next.If the variability within the samples in a subgroup gotlarger, it would mean that the variability of the system gotlarger due to the changes that were introduced recently andthus increased the instability of the system. This sort ofchanges needs to be detected rigorously and also taken careof, which would be possible by measuring and estimatingthe range of at least several samples.

In general performance testing requires many resourcesin intensive usage requiring several hours of execution time.Considering the limited hardware resources and the needfor daily monitoring, we limit the number of tests to two.When the range for measurements of the two tests gets out of

the predetermined range, an alarm goes off on the monitor.As for the range monitoring, we do not get into detail since itis not directly related to the main theme of the paper.

3.4.2 Moving Average for the Target Mean

CUSUM charts sum up the variation from the target mean0. For a manufacturing process the target mean may be setby the specification or set as the real target value by asystem designer. However, it would not work well for asoftware product.

If the target value for a given performance metric for asoftware product can be inferred through software model-

ing, the target mean might be determined in advance. Asmentioned by Woodside et al. [24] a modeling approachwould be based on many approximations. Because of theapproximations involved it would be very difficult to catch

subtle changes in performance occurring due to changes ina part of implementation. For a DBMS a new version is notbuilt from scratch but extended from an already existingcode basemostly in small increments and yet producingbig enough changes in performance in some cases that weshould be concerned about. It is more of an evolutionarychange than an architectural change in a large scale. Itwould be very difficult to catch most or all of the changesthat are being introduced during the implementation phaseby some sort of modeling technique.

As we monitor the performance of a software product,we typically compare the performance of each new packageagainst the performance of the base package selected from aknown point in the past. In practice, this is very difficult todo. There may be multiple performance metrics within asingle test such as TPC-H. As we gather measurements formultiple performance metrics within a single test, differentcode changes could affect different performance metrics;different metrics could see performance changes at differentpoints in time. For these reasons it is very difficult toidentify a package that can be regarded as the base packagefor all of the metrics of interest as can be seen in Fig. 2. Evenwhen we are considering only one performance metric, theperformance value could continue to change in smallamounts as can be seen in Fig. 3. When we select a pointin the past as the base value, the variations among thefuture values will be different depending on which pastpoint was selected. When we obtained the RunID 40 valuein Fig. 3, for example, the amount of variation would bedifferent depending on which past point was selected as thebase line. Whether a performance change is to beconsidered an anomaly or not would depend on which

point in the past is selected.Based on this observation, we decided to use a calculated

value which is the mean within the period of interest as isdone with Shewhart charts when the target mean is notspecified [7]. We maintain the number of subgroups, awindow, to be a fixed size as we calculate the targetmeanwhen a new package is added to the system, itreplaces the oldest one in the window. Within a fixed timeperiod the number of valid subgroups can vary according tothe number of successful packages and the number of testsfrom which we can get the valid results. That is why we usea fixed number of subgroups rather than a fixed period of

time. When the target mean is recalculated, the mean andstandard deviation change as well thus affecting thereference value K and decision interval H for the CUSUMchart. We in effect are using a moving window approachwith a fixed number of subgroups. In Section 3.4.3, wedescribe in detail how to determine the window size.

With the moving window approach applied to a CUSUMchart it is possible that a metric measurement, which wasconsidered normal before a new package was added, couldall of a sudden become an alarm, albeit a false alarm. Thiscould happen if there are continuous performance gains orlosses even if each gain or loss is small as new subgroups

are being added to the window affecting the mean valueconsistently on one side (gain or loss) over time. That is, dueto the changed target mean caused by newly added values,the CUSUM values of the old subgroups could be changed



7/16

thus possibly making it exceed the limit. For example, theRun ID 15 in Fig. 4 shows that the lower CUSUM value(CUSUM) exceeds the lower limit setting off an alarm.This happened because the target mean (center line)increased due to the new values beyond Run ID 30 thuschanging the CUSUM value for the old measured valuessetting off an alarm. To prevent this sort of false alarms wetreat the alarms that occur at or around a new subgroup as

true alarms. For example, if we applied CUSUM charts withthe window covering Run IDs 1 through 40 in Fig. 4, wewould accept the alarm that occurs at Run ID 40 as a truealarm and the one occurring at Run ID 15 as false.

3.4.3 Number of Subgroups in a Window

The number of subgroups within a window, i.e., the windowsize, is a factor that affects the mean value and the behaviorof a CUSUM chart the most.

In this section, we describe how to decide an appropriatewindow size. It is done through a simulation of a CUSUMcontrol chart using some number of randomly generated

data sets.We generate 20 data sets. Each set consists of 60 randomlygenerated Gaussian (or normally) distributed numbers.Each set has a small mean shift at the 46th among 60numbers. Fig. 5 shows one such example of data setsincluding a 1:5mean shift at Run ID 46. In Fig. 5, the first45 numbers are generated according to the Gaussiandistribution with the mean 100 and standard deviation 1.0.The subsequent 15 numbers are generated with a shiftedmean of about 102 and standard deviation 1.0. During thesimulation, we count the number of alarms found by theCUSUM chart with k0:5and h4:77.

We assume that the numbers randomly generated meanthe average of the samples in the subgroup. The mean forthe data that were generated after a mean shift has occurredmust be adjusted taking the sample size into consideration.That is, with the sample size K if we represent the standarddeviation for the average of the subgroup as xbarassuming that the average value for the subgroup exhibitsa normal distribution N(100, 1), the standard deviation ofthe raw sample values of the subgroups can be expressed asxbar = ffiffiffiffiKp orxbar ffiffiffiffiKp [7]. For example, whenxbaris 1 and the sample size is 2, the actual distributionfor the sample data should be N100; ffiffiffi2p and if 1meanshift occurs it would mean that actually

ffiffiffi2p mean shift must

occur in the real value scale. In Fig. 5, the first 45 valuesfollow N(100, 1) distribution and the next 15 values followN1001:5 ffiffiffi2p ; 1 distribution for the 1:5mean shift.

Using the method just described above, we generatedthree groups each consisting of 20 data sets, each grouphaving a mean shift of 1, 1:5, and 2, respectively. Foreach of the three we simulated a CUSUM chart andidentified the Run IDs that exhibit a mean shift. Fig. 6shows the results of the simulation: for each of the threemean shifts we tried various numbers of subgroups, i.e.,window sizes, to see how many subgroups each took to

detect the mean shift that occurred at Run ID 46 in Fig. 5.Fig. 6 shows what percentage of anomalies was detectedusing various window sizes. In general, the larger thenumber of subgroups we used in a window, the higherpercentage of anomalies was detected. For the 2 meanshift, we see that all of the anomalies for all 20 data setswere detected with 30 or larger subgroup sizes. For the 1:5mean shift, it took a subgroup size of 40 to detect all of theanomalies. On the other hand for the 1mean shift it onlydetected 40 percent of the anomalies with a subgroup sizeas large as 40.

Fig. 7 shows the CUSUM chart simulation result in more

detail for the 20 data sets having a mean shift of 1

:5

. Foreach of the 20 data sets, we gathered simulation results aswe increase the window size by 1 in each increment from 20to 40. For each window size between 20 and 40, wesimulated a CUSUM chart with a moving window in such away that the last Run ID of the window changes from 46 to60. For example, for the window size 20 one of the movingwindow sizes would be from Run ID 30 to Run ID 49. Inthat case we were able to detect a mean shift in one of the20 data sets as can be seen in Fig. 7.

On the other hand for window size 40 we were able todetect a mean shift in all 20 data sets in each of the caseswhere the last Run ID is 53, 55, and 56. As shown in Fig. 6,


Fig. 5. Randomly generated Gaussian data having 1:5mean shift.

Fig. 6. Anomaly detection success ratio for various mean shifts withvarying window sizes (subgroup size).


8/16

Fig. 7 shows what percentage of anomalies was detected ateach Run ID. The axis labeled Run ID in Fig. 7 means thelast Run ID for a given moving window. The line graph for1:5mean shift in Fig. 6 came from Fig. 7 showing whatpercentage of the anomalies could be detected for each

window size.Based on the simulation results we determined that themost appropriate window size would be 40. Using 40 as thewindow size we were able to detect all of the mean shifts forthe 20 data sets having a mean shift of1:5, and 2; and evenfor the 1mean shift we were able to detect the mean shifts in40 percent of the data sets. Based on our experience with oursystem, performance values generally show a relativelysmallstandard deviation at the level of about 1-2 percent withrespect to the mean value. To deal effectively with situationslike these we determined that a mean shift of1:5would begood enough of a limit to be used in deciding something as ananomaly, which would mean that we detect the mean shift of

2-3 percent with respect to the target mean value in manycases.

We found something interesting as we analyzed thesimulation results. To see if we could detect a 1mean shiftwe used k0:5 as we designed CUSUM charts. When weused the window size 40, however, we were able to detect amean shift in only about 40 percent of the data sets having a1mean shift. This is a rather unexpected phenomenon thatcan be explained as follows: as we calculate the mean for amoving window, the calculated mean value may beincreased due to a mean shift and the standard deviationvalue is also increased. As increases, both K and H also

increase. Due to the increased K, the cumulative sum C+

(or CUSUM) will decrease but will not exceed theincreased control limit due to the increased H (see (3.1)).The smaller the window size is, the worse this effect gets.Fig. 8 shows a CUSUM chart using a window size 20 for thedata set shown in Fig. 5. Each line means a CUSUM chartwhere the last Run ID in a moving window is 46, 49, 52, and55, respectively. This is one example where we have no

alarm with 20 subgroups in a moving window. As wedescribed above, we can see the control limit H changes as afunction of standard deviation. For the CUSUM charts thatuse a window whose last Run ID is 49 or 52, CUSUM+ islarger than the control limit that was calculated at Run ID46. However, an alarm cannot be issued because the controllimit is increased also with the changed standard deviation.

If we increase the window size continuously beyond 40,in theory we could expect better sensitivity for the 1meanshift. In reality, however, there are many obstacles inincreasing the window size continuously. For example,gathering measurements using 40 subgroups means that wewould have to gather data for more than eight work-dayweeks in the case of daily measurement, even more with anincreased window size bringing in that much more olddata. As the code gets revised with continuous develop-ment, the older the data, the more the variation, whichcould come from the mean shifts occurred in the past, thusincreasing the standard deviation, which could result in lesssensitivity for CUSUM charts.

The anomaly detection method with CUSUM chartsdescribed in Section 3 is implemented as an automated toolin our framework. With the tool we were able to eliminatealmost all of our manual work needed to monitor themetrics that we used in various performance tests. This toolis described in detail in Section 5.

4 ROOT CAUSE INVESTIGATION OF ANANOMALY

Once we find an anomaly using CUSUM charts, weinvestigate to find its root causes. This section presentsour approach in finding the root causes.

4.1 Readymade Profiling Data

We describe the need for readymade profiling data andsome issues we have to deal with in gathering the data.Performance anomalies could result from various causes

and finding their root causes is a complex task requiring


Fig. 8. CUSUM changes according to moving window with subgroup range 20.

Fig. 7. Detection ratio on each Run ID for 1:5mean shift.


9/16

skills. A detected anomaly could be related to someunexpected conditions of the testing environment (such asthe server itself or the disk space used for the system) or to amalfunction of testing tools. If the anomaly is not fromone of these causes, it would be caused from the internals ofthe system in question in which case we rely heavily on theperformance profiling data. Comparing the profiling resultsfrom an earlier stable package with those from the currentpackage in question could result in valuable clues about thesuspect code changes that caused the anomaly.

However, this investigation does not generally start untilan anomaly has been detected. This would mean that wewould have to install the earlier stable package again andexecute the performance tests with a performance profiler.We would also have to run the tests with a profiler for thecurrent package where the anomaly was detected. Theprofiling done after an anomaly has been detected wouldrequire much time and manual overhead. To reduce thesemeasurement costs we always run a profiler duringperformance regression testing. The profiling data gathered

in advance (the readymade data) allow us to start theinvestigation immediately after an anomaly is detected thusbeing able to implement a quick feedback loop to thedevelopers with quantitative analysis results.

There are several difficulties in applying this approachthough. An exhaustive approach, which would get profilingdata for all test cases, would give the best results, but it wouldnot be feasible due to an unacceptable operational cost.Confined by limited time and hardware resources availablefor theregression testing, we decided to runprofiling for onlyimportant metrics of a few representative performance tests.That is, we chose the tests that use the representative

workloads and the metrics of the tests that expose thechanges well of the internal behavior of the system.

4.2 Sampling-Based Differential Profiling

The approaches to gather profiling data can be divided intotwo categories: sampling-based statistical analysis andinstrumentation approach [34], [39]. Although they bothhave their strengths, we prefer the sampling approach to theinstrumentation approach for the following reasons. Theinstrumentation approach may affect the internal behaviorsuch as cache behavior and memory utilization patternbecause it requires inserting its code into the original code to

gather profiling data. This kind of unexpected code changescould make it difficult to get the exact profiles that we wouldexpect from the original code. As indicated by Reiss [34], theinstrumentation approach usually increases the overhead bymore than 25 percent and degrades the performancenoticeably. Sampling-based approach on the other handdoes not impact the performance of the system noticeably.

When we investigate an anomaly, it is essential that wecompare performance profiles gathered from two differenttest runs. If we see a decrease in performance of the systemin a test run, we might be able to identify the culpritfunction(s)for example, if we see a noticeable increase inCPU clock tick consumption by a particular function from a

test run to the next, we would infer that some changes inthat function would be causing the performance dropaffecting the throughput of the system. The concept of thiscomparison approach is known as differential profiling

introduced by McKenney [27]. He repeated the sameperformance tests while varying workloads. As the work-load changes he was able to identify the functions andmodules that responded sensitively to the workload. In [20],Schulz and de Supinski presented a tool, (eGprof) whichenables users to subtract two performance profilesdirectly and show the differences as callgraph visualization.

It is important that we use a profiler that does not affect

the performance of the system while the profiler is running.It is also important that we gather profiling measurements inan environment that is close to the real operating environ-ment of the system. To meet these requirements we applydifferential profiling based on sampling-based statisticalanalysis. Since the sampling-based approach does notimpact the performance of the system noticeably, we canrun the profiler while running performance regression tests.

Through profiling we can gather various information foreach function such as CPU clock tick consumption, numberof instructions retired, L2 cache misses, and branch mis-predictions. Among them we should determine which

metrics would be affected most sensitively by the changesin the internal system behavior. We measure and monitor themaximum performance of the system under a certain set oftesting conditions by maintaining the workload at a suffi-cient level. If the system is running with the maximumperformance at any given time, it would mean that theprocessor is being utilized as much as the situation allows. Ifwe detect any performance change compared with theprevious test run, it would mean that there are some changesin CPU utilization by some set of functions in the system. So,when we detect an anomaly, it is more useful to look for thechanges that are related to functions showing noticeabledifferences in clock tick consumption between two profiles.

SD benchmark is one of the most widely used bench-marks within SAP [32]. Fig. 9 shows an example ofinvestigating a performance anomaly using differentialprofiling during a regression test using the SD benchmark.In this case, an unexpected drop in performance, wascaused by an upgrade of an internal module in our product.We compared two profiling results between the most recentstable package and the current one with the anomaly for thesame benchmark test. We computed normalized clock tickconsumption by dividing the clock tick consumption bythroughput value for each function and sorted the functionsaccording to the amount of changes in clock tick consump-

tion. The table in Fig. 9 lists the top 10 functions with themost increases in normalized clock tick consumption aswell as the top 10 with the most decreases. The table revealsthat an updated function Drop_Table in the current packagemakes up 17 percent of the total changes in clock tickconsumption in the current profiling runs. The result of thisanalysis was reported to the developer responsible for themodule and the problem was fixed quickly. It turned outthat the developer introduced a new algorithm to optimizethe storage management in dealing with intermediateresults in executing SQL statements. In doing so he didnot anticipate the effects of dealing with small result sets

while handling diverse workloads. As soon as an updatedmodule was checked into the system, our automatedprocess with SPC charts was able to detect the anomalyand set off an alarm. As he was trying to resolve the



10/16

anomaly in hand, he was able to use the profiling data

repeatedly until the issue was completely resolved to satisfythe expected system performance standards. This illustratesthe benefit of our automated profiling tool which makes useof the readymade profiling data for not only detecting aperformance anomaly, but also finding its root causes andoptimizing the code at the time of developing the code.

Running SD benchmark requires multiple machines andtakes several hours to run it once including test datageneration. Using readymade profiling data cuts downthe entire testing time considerablywithout the ready-made profiling data it used to take at least 2 to 3 days toprepare the kind of differential profiling data and investi-

gate an anomaly. Now with the tool in place it only takes anaverage of 2 to 3 hours to detect an anomaly and generatean analysis report needed for the developer.

5 FRAMEWORK FOR PERFORMANCE ANOMALYMANAGEMENT

In this section, we present a new framework of our ownthat enables automated anomaly detection and efficient

investigation of root causes using the approaches described

in Sections 3 and 4.

5.1 Configuration

Performance anomaly detection and root cause investiga-tion invariably rely on the existing infrastructure forpackaging and testing because regression testing requiresfrequent packaging and testing based on the existingframework. All the changes submitted in the developmentbranch of the code are followed by automated function-ality/performance regression testing using the agile devel-opment process described in Section 2.2.

Fig. 10 shows a configuration of the framework the key

idea of which is to support an early performance feedback.The key components of the framework are isolated insidethe box outlined with dotted lines.

All the information related to performance testing suchas database parameters, test parameters, and test results isstored in a QA database so that it can be used as needed infuture testing or to be compared with other test results.Some of the performance tests are designed specifically todeal with various workloads and they produce various


Fig. 10. Configuration of the framework for an early performance feedback.

Fig. 9. Example of sampling-based differential profiling results.


11/16

results sets and graphical representations for viewing. TheQA database is designed to support these demands byincluding common schema.

The performance monitor in the framework displays theperformance results stored in the QA database in varioustypes of visual charts. It enables us to compare the recentperformance results with the ones gathered in the past andcan also show the recent performance trends. It is describedin detail in Section 5.3. Performance anomaly detector is animplementation of the SPC charts described in Section 3 thatdetects performance anomalies. Every morning it analyzesthe performance test results generated the night before andsends an email report showing whether there was anyanomaly found or not. It is described in Section 5.4.

When a performance anomaly is alarmed by the PAD, aquality engineer investigates the reasons for the anomalyusing the readymade profiling data as described in Section 4.The engineer uses the differential profiler to compare twodifferent profiling results using the clock tick consumptioninformation for each function for the two packages beingcompared. It shows the functions with the most changes inclock tick consumption. It is described in Section 5.5.

5.2 Common Database Schema to Monitor Several

Performance MetricsEach performance test involves a unique set of metrics anddifferent metrics use different formats for representing theresults as well as different styles for showing the result

charts to the user. We had a rather unpleasant experience inthe past of having to add a new schema and implement adedicated performance monitor every time we added a newperformance test. To resolve this difficulty and handlediverse performance tests we added a common and openschema to our QA database which is a part of theframework. Using this integrated common schema we storevarious types of test results and display them on a single,common performance monitor with several types of chartsfor viewing the results.

5.3 Performance Monitor

Visualization is an efficient tool for monitoring the trendsfor various performance metrics. We implemented a Web-based performance monitoring system by which wemonitor the results in the QA database for various kindsof tests using only a single unified visualization system.

The results for a test can be summarized and displayedon a single page on the monitor screen. The amount ofinformation displayed on a page is controlled by a set ofparameters including the number of test runs to be used asdefault. If a test consists of multiple subtests, the charts forthe entire set of subtests are displayed on a page as well. Ascan be seen in Fig. 11 the pertinent details for a metric are

displayed in the form of a table below the chart that showsthe information. Users can easily view the overall perfor-mance trends of a session at a glance and open it as anenlarged chart for a detailed view. A search (along with


Fig. 11. Sample view of performance monitor.


12/16

filtering) capability using the criteria such as user, version,server, etc., is also provided so that a user can search theresults while comparing the results if desired. By employ-ing a Web-based GUI the performance monitor reduces the

overhead of manual monitoring of the performance data.Fig. 11 shows a snapshot during performance monitoring

for a problematic case while performing regression testingusing SD Benchmark as presented in Section 4.2. The chartat the top of the figure shows the overall recent trend inperformance, in which each point means a value for theperformance metric observed over time. A value in theX-axis means the version number of the tested package. Themeasurements on the right side of the graph mean morerecent than the ones on the left. A value in theY-axis meansthe value of the measurement for the metric, namely, thenumber of clock ticks per throughput. The four lines in the

chart mean the performance results for four differentnumbers of client connections, i.e., 4K, 8K, 12K, and 16K,respectively. The actual measurements for each metric canbe seen in the table below the chart.

5.4 Performance Anomaly Detector

PAD is a Web-based tool that does the following: for eachmetric given a test it retrieves the performance data fromthe QA database, applies SPC charts to it, and displays theresults that are considered suspect anomalies.

Fig. 12 shows a snapshot of an actual PAD result for aperformance anomaly found in an SD benchmark testdescribed in Section 4.2. The top part of the figure shows the

filters of PAD. The user can specify the test name, version,user, server, and detecting rules. In the table in the lowerpart of the figure you can see the suspect anomalies. Thechoices for the rules in detecting anomalies includeShewhart chart and CUSUM chart. In PAD, a CUSUMchart means the combined CUSUM-Shewhart chart proce-dure as described in Section 3.2.2.

In the results table we can find detailed statisticalinformation about the suspect anomaly, which includesmean, variation, and standard deviation. Using the link inthe column named Graph in the table we can view thesuspect anomaly in the performance monitor. When a

quality engineer receives an email report, he can open theperformance monitor with a provided link and will be ableto verify if it is indeed an anomaly or not. For automateddetection and reporting of anomalies PAD is executed at a

predetermined time using a set of pre-specified conditionsfor all of the important metrics of all the performance tests,and the results of running the PAD are communicated tothe engineers via email if any suspect anomalies are found.

5.5 Differential Profiler

The last component of the framework is the differentialprofiler, which is a standalone program that implements thesampling-based differential profiling described in Section 4.We specify the paths of two profiling results: one for themost recent stable package and the other for the packagethat has been found to have at least one anomaly, and alsoenter the two throughput values for the two packages asinput parameters to the differential profiler. The profilerthen normalizes the clock tick consumption of each functionusing the throughput value and compares the differencesbetween the two normalized values from both packages. Inthe results table it displays the list of functions ordered bythe differences in such a way that the user can find whichfunctions affected the performance the most. A sampleoutput of the profiler has already been shown in Fig. 9 inSection 4.2.

5.6 Effectiveness and Limitations

By employing the framework described above we were ableto migrate our weekly or monthly monitoring cycle ofseveral key performance defining metrics to a daily cycle,which makes a much faster feedback possible for thedevelopers therefore reducing the anomaly investigation

cost. Reduction of the cost is gained primarily by automateddetection of the anomalys culprits as well as narrowing thesearch space where the problem lies during the analysisphase using the readymade profiling data.

Before we employed the framework we used to spendabout one-person day each week in monitoring about 100metrics of a dozen performance tests. With the frameworkincorporated now almost all of the manual overhead hasbeen eliminated. With an anomaly reported by the PAD wenow spend only a few minutes to confirm it by opening thelink in the GUI-based performance monitor and checkingthe related charts.

Performance measurements for our product show a stablevariation. The standard deviation of more than 70-80 percentof the total performance metrics is only 1-2 percent of thetarget mean value. For the other metrics the standard


Fig. 12. Sample report of the performance anomaly detector.


13/16

deviation is at most 4-7 percent of the target mean. If the testenvironment is stable and the variation of the measuredmetrics is stable, it is reasonable to expect to find a 1:5meanshift and CUSUM charts are well suited for findinganomalies. However, if the performance of the product isnot stable or the variation is very high, it is possible to seemany false alarms with CUSUM charts. In that case, we first

should find and remove the factors that cause the largevariation whether they are from the internals of the system orfrom the environment to reduce the false alarms andmanagement overhead. Until the process and the productbecome stable, Shewhart charts would be better suitedbecause they are more useful for finding big anomalies [7].

Once an anomaly has been detected, investigating it tofind the culprit change(s) and issue an analysis report takesanywhere from several hours to several days depending onthe test and the complexity of the problem. In our system,the average reporting time was about 1.5 days in the past;after applying the framework with the readymade profilingdata, however, we were able to reduce the investigation andreporting time to an average of 2-3 hours.

Fig. 13 shows the distribution of root causes for theanomalies found by using the framework32 percent of theroot causes were unexpected side effects occurred whilerevising the code mainly in the lock module, queryevaluator, or results set management; 27 percent fromdesign issues of side effects of new modules (It is not easy tocheck the performance effect of a newly added module atthe design time. Generally, performance issues are foundafter the module is merged into the system.); 14 percent dueto bugs mainly in query optimizer to generate nonoptimalquery plans; 9 percent from careless coding such as misuses

of internal threads or leaving the trace log on by mistakeafter debugging is done; and the remaining 18 percent frommiscellaneous issues such as default configuration para-meter changes, etc.

Fig. 14 shows the contribution of sampling-basedpreprofiling to the problems that we encountered duringthe past year. For 62 percent of the anomalies found thequality engineers could easily find their root causes usingthe results from the differential profiler. For 19 percent ofthem the quality engineers could not resolve themthemselves, but the developers were able to find the rootcauses using the profiling data without conducting any

further experiments. For 13 percent of the cases, we couldnot resolve the problem by sampling-based preprofilingdue to the complexity of the anomalies and required moreexperiments to find the causes. In the remaining 6 percent

the anomalies were found in some tests for which we didnot apply preprofilingwe had to run profiling after wefound the anomalies for those cases and were able toanalyze and find the root causes.

Not every problem can easily be resolved by thesampling-based differential profiling approach. As we

compare the profiling data for a detected anomaly, it iseasy to find the root cause if the changes in clock tickconsumption are concentrated in a small number ofcontributing functions. However, if the changes in clocktick consumption are spread thinly over a number offunctions, it is hard to identify the culprit function(s) usingthe approach. In that case, we could use callgraph analysisthat uses the caller and callee relationship for each functionto trace the changes in clock tick consumption betweenthem. In general callgraph analysis tools apply theinstrumentation approach, although recently we begin tosee tools based on the sampling-based approach that allow

callgraph analysis such as Intel Performance Tuning Utility(PTU) [40] and SunStudio [35]. In fact, we are alreadyapplying differential profiling using these sampling-basedcallgraph tools to our presented framework. However, inthis paper, we focus on the introduction of the existingframework using preprofiling in reducing the investigationcost. We plan to discuss the utilization of callgraph analysisin a separate paper in the near future.

6 RELATEDWORK

During the recent decade software performance became ahot issue in software engineering, and many researchers

and practitioners have written about performance testingmethodologies. In [30] and [31], Barber presents the state ofindustrial performance measurement and testing techni-ques. Recently, these software engineering approaches andanalysis methods have been discussed under the SoftwarePerformance Engineering (SPE) umbrella. In [24], Woodsideet al. introduce the current status of SPE and presents someissues. He categorizes the overall related efforts into twodistinct approaches: an early cycle predictive model-basedapproach [10], [13], [33] and a late-cycle measurement-based approach [5], [30], [31].

This paper presents a framework by which we can easily

detect the existence of a performance anomaly, introducedinto a software system during development, and investigatethe root causes of the anomaly. We are applying the late-cycle measurement-based approach mentioned in [24]. Our


Fig. 13. Root cause distribution of performance anomalies.Fig. 14. Contributions of preprofiling in finding the root cause of ananomaly.


14/16

research is closely related to one of the future goalsmentioned in [24], i.e., better methods and tools forinterpreting the performance results and diagnosing per-formance problems.

Unfortunately there is almost no published research on aframework for performance anomaly management. Thework in [5] is the closest that we could find, where Thakkaret al. present a framework for measurement-based perfor-mance modeling. Its conceptual diagram looks similar toours, but its purpose and contents are entirely different.They build a performance model using the measuredperformance results for future usage such as capacityplanning on different platforms.

In recent years people have insisted on a new paradigmfor DBMS architecture to meet different kinds of require-ments and a higher level of performance [23], [19]. To betterutilize the ever-evolving computer hardware architectureand meet the demands for new data storages, DBMSengines constantly seek to enhance their functionality andinnovate new techniques. While DBMS functionality hasadvanced significantly in recent years, the methodology fortesting the engines has not kept up with the pace of systemdevelopment. Only recently DBMS testing has gained someattention in the database community [6], [15], [26]. In [11]and [12], Haftmann et al. suggested a framework forefficient regression testing on database applications. Re-search efforts on the performance of a DBMS, however,have mainly focused on the DBMS engine itself in the formof self-tuning or automatic performance management as anew functionality of the DBMS [3]. To the best of ourknowledge there has not been any published work on aperformance management framework or adoption of

statistical process control to detect performance anomaliesduring a DBMS development.

Several control charts have been developed for SPC. In[7], Montgomery introduces the basic concept of eachcontrol chart and explains the details of each. SPC controlcharts have been used mainly in the manufacturing field tomonitor the production processes. In [7], the following twoprerequisites are suggested in applying SPC in otherindustries. 1) The data generated by the process when it isin control are normally and independently distributed withmean and standard deviation: both of them are consideredfixed and unknown. 2) Any correlation over time will make

control charts not work well (independence of the observa-tions). Recently, we can find some research efforts to applySPC to other fields such as energy use and estrous detection[38], [1]. Some research has also been done in softwareengineering to apply SPC to improve software developmentprocesses. In [25], Komuro presents their experiences ofapplying SPC techniques to a software developmentprocess with several real-world examples, especially inbug rate management and peer review processes. Theyinsist that SPC is really useful for improving a softwaredevelopment process and that we should be more process-centric rather than product-centric when applying SPC. In

[16], Canguss et al. present a variant of SPC based on alogarithmic transformation to control a software testprocess using key quality factors such as code coverage,number of remaining errors, and failure intensity because

these factors exhibit an exponential behavior and traditionalSPC methods are not suitable to deal with them.

Anomaly detection has been worked on in variousapplication areas as described by Chandola, et al. [37].Each technique is based on its own model with its own setof assumptions. In monitoring performance, we can observechanges in the values of a metric over time thus assume thatnormal data follow a known distribution making it possibleto adopt a single dimensional model. We could apply someapproaches used in data mining such as nearest neighbor-based approach, which examines spatial proximity within adata space to find anomalies. However, to find performanceanomalies it seems more reasonable to apply a statisticalapproach such as SPC charts because its assumptions andmodel are well suited for our purpose with less computa-tional overhead than other approaches. As described inSection 2.2, we apply SPC charts for various metrics invarious performance tests. Our framework is metric-independent as long as the measured values for a givenmetric exhibit a normal distribution because SPC charts arebased on a statistical model.

Performance profiling is already widely used to inves-tigate software performance problems. In [34], Reissintroduces several tools that we could use to supportperformance profiling and present a methodology andframework named DYPER, by which we could obtainprofiling data within the user-defined overhead using thesampling and instrumentation approaches. As for thecomparison approach of the two profiling results that weapply in the presented framework, there have been similarefforts such as those described in [27] and [20] as wedescribed in Section 4.2. In our framework, however, our

main focus is on utilizing sampling-based statisticalanalysis to obtain the profiling data without any noticeableperformance interference.

7 CONCLUSION AND FUTURE WORK

In this paper, we described a framework that we developedto manage performance anomalies for a real-world DBMSproject at SAP involving several hundreds of changes a dayintroduced by multiple developers. The framework wasbased on a cost-minimizing approach by which we detectanomalies automatically and find their root causes using

readymade data.For the approach, we applied the concept of statistical

process control to a software product to monitor perfor-mance metrics. We used measured performance results toshow that there are large and small anomalies present inthe system, and proposed that a combined CUSUM-Shewhart chart be used in such a situation. We coulddetect anomalies with a mean shift as small as 1:5usingCUSUM charts and detailed the design issues to consider inapplying the charts. In determining target means we used amoving average of a specific number of subgroups andproposed an optimal subgroup size through simulation. By

employing such an automated anomaly detection systemwe were able to show a substantial reduction in the manualcost that we used to suffer through in monitoring theperformance related metrics.



15/16

We used a sampling-based differential profiling ap-proach to investigate the root causes of an anomaly.Preparing profiling data needed for the approach after ananomaly has been detected would take too much time tobe practical. To minimize the time it takes for root causeinvestigations we generate profiling data in advance whileperforming regression tests. Once an anomaly has been

detected, we compare the profiling data from the mostrecent stable package with those from the package with theanomaly. By using such readymade profiling data we wereable to reduce the cost of anomaly investigation by as muchas 90 percent.

The framework that we developed and applied to aDBMS development is general enough to be applicable toany software development where new versions are intro-duced during the development with continuous function-ality enhancements, thus requiring continuous monitoringof a certain set of performance metrics.

In the future, we hope to expand the scope of automa-tion, e.g., automating the process of identifying the culpritchanges (root causes) based on some detailed analysis ofprofiling data including report generation of possibleculprits and hope to find other areas of software develop-ment where the framework can be applied.

REFERENCES[1] A. de Vries and B.J. Conlin, Article: Design and Performance of

Statistical Process Control Charts Applied to Estrous DetectionEfficiency,J. Dairy Science vol. 86, pp. 1970-1984, 2003.

[2] A. Avritzer, J. Kondek, D. Liu, and E.J. Weyuker, SoftwarePerformance Testing Based on Workload Characterization,Proc.Third Intl Workshop Software and Performance (WOSP 02), pp. 17-24, 2002, doi:10.1145/584369.584373.

[3] A. Thiem and K.-U. Sattler, An Integrated Approach toPerformance Monitoring for Autonomous Tuning, Proc. IEEEIntl Conf. Data Eng. (ICDE 09), pp. 1671-1678, 2009, doi: 10.1109/ICDE.2009.142.

[4] CMMI, http://www.sei.cmu.edu/cmmi/, 2012.[5] D. Thakkar, A.E. Hassan, G. Hamann, and P. Flora, A Frame-

work for Measurement Based Performance Modeling, Proc.Seventh Intl Workshop Software and Performance (WOSP 08),pp. 55-66, 2008, doi: 10.1145/1383559.1383567.

[6] D.R. Slutz, Massive Stochastic Testing of SQL, Proc. 24th IntlConf. Very Large Data Bases (VLDB 98), pp. 618-622, 1998.

[7] D.C. Montgomery, Introduction to Statistical Quality Control, fifthed. John Wiley & Sons, Inc., 2005.

[8] D.M. Hawkins, Cumulative Sum Control Charting: An Under-utilized SPC Tool, Quality Eng., vol. 5, no. 3, pp. 463-477, 1993,

doi: 10.1080/08982119308918986.[9] D.M. Hawkins and D.H Olwell, Cumulative Sum Charts andCharting for Quality Improvement. Springer Verlag, 1998.

[10] E.J. Weyuker and F.I. Vokolos, Experience with PerformanceTesting of Software Systems: Issues, IEEE Trans. SoftwareEng., vol. 26, no. 12, pp. 1147-1156, Dec. 2000, doi: 10.1109/32.888628.

[11] F. Haftmann, D. Kossmann, and E. Lo, A Framework for EfficientRegression Tests on Database Applications,The Intl J. Very LargeData Bases, vol. 16, no. 1, pp. 145-164, 2007, doi: 10.1007/s00778-006-0028-8.

[12] F. Haftmann, D. Kossmann, and E. Lo, Parallel Execution of TestRuns for Database Application Systems,Proc. 31st Intl Conf. VeryLarge Data Bases, pp. 589-600, 2005.

[13] G. Denaro, A. Polini, and W. Emmerich, Early PerformanceTesting of Distributed Software Applications, Proc. Fourth Intl

Workshop Software and Performance (WOSP 04), pp. 94-103, 2004,doi: 10.1145/974044.974059.[14] J.M. Lucas and R.B. Crosier, Combined Shewhart-CUSUM

Quality Control Schemes, J. Quality Technology, vol. 14, no. 2,pp. 51-59, 1982.

[15] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P.J.Weinberger, Quickly Generating Billion-Record Synthetic Data-

bases,Proc. ACM SIGMOD Intl Conf. Management of Data, 1994,doi: 10.1145/191839.191886.

[16] J.W. Cangussu, R.A. DeCarlo, and A.P. Mathur, Monitoring theSoftware Test Process Using Statistical Process Control: ALogarithmic Approach, ACM SIGSOFT Software Eng. Notes,vol. 28, no. 5, pp. 158-167, 2003, doi: 10.1145/940071.940093.

[17] J. Lee, K. Kim, and S.K. Cha, Differential Logging: A Commu-tative and Associative Logging Scheme for Highly Parallel Main

Memory Database, Proc. 17th Intl Conf. Data Eng. (ICDE 01),vol. 173, 2001, doi: 10.1109/ICDE.2001.914826.

[18] H. Plattner, In-Memopry Data Management Platform in SAP,SAPPHIRE NOW, http://www.youtube.com/watch?v=iUUH_HOs7DI, 2010.

[19] H. Plattner and A. Common, Database Approach for OLTP andOLAP Using an In-Memory Column Database, Proc. 35thSIGMOD Intl Conf. Management of Data (SIGMOD 09), June/July2009, ACM 978-1-60558-551-2/09/06.

[20] M. Schulz and B.R. de Supinski, Practical Differential Profiling,Proc. EuroPar, pp. 97-106, 2007.

[21] M. Hauswirth, P.F. Sweeney, A. Diwan, and M. Hind, VerticalProfiling: Understanding the Behavior of Object-Oriented Appli-cations, Proc. 19th Ann. ACM SIGPLAN Conf. Object-OrientedProgramming, Systems, Languages, and Applications, (OOPSLA 04),pp. 251-269, 2004, doi: 10.1145/1028976.1028998.

[22] MaxDB, https://www.sdn.sap.com/irj/sdn/maxdb, 2012.[23] M. Stonebraker, S. Madden, D.J. Abadi, S. Harizopoulos, N.

Hachem, and P. Helland, The End of an Architectural Era: (ItsTime for a Complete Rewrite), Proc. 33rd Intl Conf. Very LargeData Bases (VLDB 07), pp. 1150-1160, 2007.

[24] M. Woodside, G. Franks, and D.C. Petriu, The Future of SoftwarePerformance Engineering, Proc. Future of Software Eng., pp. 171-187, 2007, doi: 10.1109/FOSE.2007.32 2007.

[25] M. Komuro, Experiences of Applying SPC Techniques toSoftware Development Processes, Proc. 28th Intl Conf. SoftwareEng. (ICSE 06),pp. 577-584, 2006.

[26] N. Bruno, S. Chaudhuri, and D. Thomas, Generating Querieswith Cardinality Constraints for DBMS Testing, IEEE Trans.Knowledge and Data Eng.,vol. 18, no. 12, pp. 1721-1725, Dec. 2006,doi: 10.1109/TKDE.2006.190.

[27] P.E. McKenney, Differential Profiling,Proc. Third Intl WorkshopModeling (MASCOTS 95), pp. 237-241, 1995, doi: 10.1109/MASCOT.1995.378681.

[28] S.K. Cha and C. Song, P*TIME: Highly Scalable OLTP DBMS forManaging Update-Intensive Stream Workload, Proc. 30th IntlConf. Very Large Data Bases (VLDB 04), pp. 1033-1044, 2004.

[29] S.K. Cha, S. Hwang, K. Kim, and K. Kwon, Cache-ConsciousConcurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems,Proc. 27th Intl Conf. Very LargeData Bases (VLDB 01), pp. 181-190, 2001.

[30] S. Barber, Beyond Performance Testing, http://www-128.ibm.com/developerworks/rational/library/4169.html, 2012.

[31] S. Barber, http://www.logigear.com/newsletter/explanation_of_ performance_testing_on_an_agile_team-part-1.asp, 2010.

[32] S.D. benchmark, http://www.sap.com/solutions/benchmark/sd.epx, 2012.

[33] S. Balsamo and A. Di Marco, Model-Based PerformancePrediction in Software Development: A Survey, IEEE Trans.Software Eng., vol. 30, no. 5, pp. 295-310, May 2004.

[34] S.P. Reiss, Controlled Dynamic Performance Analysis, Proc.Seventh Intl Workshop Software and Performance (WOSP 08),pp. 43-54, 2008, doi: 10.1145/1383559.1383566.

[35] SunStudio, http://developers.sun.com/sunstudio/index.jsp,2012.

[36] TPC-H, http://www.tpc.org/tpch, 2012.[37] V. Chandola, A. Banerjee, and V. Kumar, Anomaly Detection: A

Survey, ACM Computing Surveys, vol. 41, no. 3, pp. 1-58, July2009.

[38] V.S. Puranik, CUSUM Quality Control Chart for MonitoringEnergy Use Performance,Proc. IEEE Intl Conf. Industrial Eng. andEng. Management pp . 1 2 3 1- 1 23 5 , 2 0 07 , d oi : 1 0 . 11 0 9/

IEEM.2007.4419388.[39] VTune, http://www.intel.com/cd/software/products/asmo-na/eng/vtune/239144.htm, 2010.

[40] PTU, h ttp://software.in tel .com/en -us/articles/in tel-performance-tuning-utility/, 2012.



16/16

Donghun Leereceived the BS and MS degreesin electrical engineering from Yonsei Universityin 1993 and 1995, respectively, and the PhDdegree from the School of Electrical Engineeringand Computer Science at Seoul National Uni-versity in 2011. From 1995 to 2001, he was asoftware engineer at Samsung electronics. From2002 to 2005, he was a software engineer andproject manager at Transact In Memory, Inc.,working on an in-memory DBMS called P*TIME.

He is currently working for SAP which acquired Transact In Memory,Inc., in 2005 as a development manger on a new in-memory datamanagement platform project. His research interests include databasesystems, software performance engineering, and quality control andsystem management.

Sang K. Chareceived the BS and MS degreesin electrical engineering at Seoul NationalUniversity, in 1980 and 1982, respectively, andthe PhD degree in database systems at StanfordUniversity in 1991. Currently, he is working as afull professor in EECS, Seoul National Universitywhich he joined in 1992. He founded Transact InMemory, Inc., in 2000 to develop a next-generation in-memory DBMS called P*TIMEand has led it to the successful acquisition by

SAP in 2005. Since then, he continued to advance P*TIME and recentlyplayed a key role in building SAPs newly announced In-MemoryPlatform for integrated real-time analytics and transaction processing.He is currently a VLDB Journaleditor and previously served as a VLDB2006 Industrial Program Committee cochair and an IEEE ICDE 2006Program Committee Area vice-chair in DBMS Internals and Perfor-mance. His current research interests include next-generation in-memory database engines, massively parallel cloud data management,business process data management, and performance-oriented mis-sion-critical software development management. He is a member of theIEEE.

Arthur H. Lee received the BS degree from theUniversity of Utah, the MS degree from StanfordUniversity, and the PhD degree from theUniversity of Utah all in computer science, in1982, 1987, and 1992, respectively. Currently,he is the W.M. Keck associate professor ofcomputer science at Claremont McKenna Col-lege which he joined in 2005. He was anassociate professor at the University of Utahfrom 2001 to 2004. He was an assistant and

associate professor at Korea University from 1993 to 2000. Before that,he worked for about 11 years combined at Sandia National Labs as anMTS, Xerox Palo Alto Research Center as a research staff member, andEvans & Sutherland Computer Corp as a senior software engineer whileattending schools pursuing the CS degrees. His current researchinterests include programming languages, database systems, andsoftware engineering. He served as the editor-in-chief of the Journalof the Korea Computer Graphics Society from 1994 to 1997 and hasserved on program committees and/or as a reviewer for manyconferences.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

Performance Anomally Detection for Dbms-2012