LEARNING-AIDED SYSTEM PERFORMANCE MODELING IN …ufdcimages.uflib.ufl.edu/UF/E0/02/17/38/00001/zhang_j.pdfLEARNING-AIDED SYSTEM PERFORMANCE MODELING IN SUPPORT OF SELF-OPTIMIZED RESOURCE

LEARNING-AIDED SYSTEM PERFORMANCE MODELING IN SUPPORT OFSELF-OPTIMIZED RESOURCE SCHEDULING IN

DISTRIBUTED ENVIRONMENTS

By

JIAN ZHANG

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007

1

c© 2007 Jian Zhang

2

To my family.

3

ACKNOWLEDGMENTS

I would like to express my sincere gratitude to my advisor, Professor Renato J.

Figueiredo, for his invaluable advice, encouragement, and support. This dissertation would

not have been possible without his guidance and support. My deep appreciation goes to

Professor Jose A.B. Fortes for participating in my supervisory committee and for all the

guidance and opportunities to work in the In-VIGO team that he gave me during my

Ph.D study. My deep recognition also goes to Professor Malay Ghosh and Professor Alan

George for serving on my supervisory committee and for their valuable suggestions. Many

thanks go to Dr. Mazin Yousif and Mr. Robert Carpenter from Intel Corporation for their

valuable input and generous funding for this research. Thanks also go to my colleagues

in the Advanced Computing Information Systems (ACIS) Laboratory for their discussion

of ideas and years of friendship. Last but not least, I owe a special debt of gratitude to

my family. Without their selfless love and support, I cannot imagine what I would have

achieved.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1 Resource Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . 141.2 Autonomic Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 181.3.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 181.3.4 Other Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . 19

1.4 Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.4.1 Virtual Machine Characteristics . . . . . . . . . . . . . . . . . . . . 201.4.2 Virtual Machine Plant . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 APPLICATION CLASSIFICATION BASED ON MONITORING AND LEA-RNING OF RESOURCE CONSUMPTION PATTERNS . . . . . . . . . . . . . 24

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 272.2.2 k-Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . . 30

2.3 Application Classification Framework . . . . . . . . . . . . . . . . . . . . . 312.3.1 Performance Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3.2 Classification Center . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.2.1 Data preprocessing based on expert knowledge . . . . . . . 332.3.2.2 Feature selection based on principal component analysis . 342.3.2.3 Training and classification . . . . . . . . . . . . . . . . . . 35

2.3.3 Post Processing and Application Database . . . . . . . . . . . . . . 352.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4.1 Classification Ability . . . . . . . . . . . . . . . . . . . . . . . . . . 362.4.2 Scheduling Performance Improvement . . . . . . . . . . . . . . . . . 412.4.3 Classification Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5

3 AUTONOMIC FEATURE SELECTION FOR APPLICATION CLASSIFICA-TION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2.2 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.2.3 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2.4 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Autonomic Feature Selection Framework . . . . . . . . . . . . . . . . . . . 563.3.1 Data Quality Assuror . . . . . . . . . . . . . . . . . . . . . . . . . . 563.3.2 Feature Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.3.3 Trainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.4.1 Feature Selection and Classification Accuracy . . . . . . . . . . . . . 623.4.2 Classification Validation . . . . . . . . . . . . . . . . . . . . . . . . 653.4.3 Training Data Quality Assurance . . . . . . . . . . . . . . . . . . . 71


4 ADAPTIVE PREDICTOR INTEGRATION FOR SYSTEM PERFORMANCEPREDICTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.3 Virtual Machine Resource Prediction Overview . . . . . . . . . . . . . . . 774.4 Time Series Models for Resource Performance Prediction . . . . . . . . . . 804.5 Algorithms for Prediction Model Selection . . . . . . . . . . . . . . . . . . 82

4.5.1 k-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.5.2 Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . 834.5.3 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 85

4.6 Learning-Aided Adaptive Resource Predictor . . . . . . . . . . . . . . . . . 864.6.1 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.6.2 Testing Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.7 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904.7.1 Best Predictor Selection . . . . . . . . . . . . . . . . . . . . . . . . 904.7.2 Virtual Machine Performance Trace Prediction . . . . . . . . . . . . 91

4.7.2.1 Performance of k-NN based LARPredictor . . . . . . . . . 924.7.2.2 Performance comparison of k-NN and Bayesian-classifier

based LARPredictor . . . . . . . . . . . . . . . . . . . . . 964.7.2.3 Performance comparison of the LARPredictors and the

cumulative-MSE based predictor used in the NWS . . . . 974.7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6

5 APPLICATION RESOURCE DEMAND PHASE ANALYSIS AND PREDIC-TIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.2 Application Resource Demand Phase Analysis and Prediction Prototype . 1085.3 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.3.1 Stages in Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.3.2 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . 1125.3.3 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135.3.4 Finding the Optimal Number of Clusters . . . . . . . . . . . . . . . 114

5.4 Phase Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.5.1 Phase Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . . . 1195.5.1.1 SPECseis96 benchmark . . . . . . . . . . . . . . . . . . . 1195.5.1.2 World Cup web log replay . . . . . . . . . . . . . . . . . . 122

5.5.2 Phase Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . 1235.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125


6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7

LIST OF TABLES

Table page

2-1 Performance metric list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2-2 List of training and testing applications . . . . . . . . . . . . . . . . . . . . . . . 37

2-3 Experimental data: application class compositions . . . . . . . . . . . . . . . . . 40

2-4 System throughput: concurrent vs. sequential executions . . . . . . . . . . . . . 44

3-1 Sample confusion matrix with two classes (L=2) . . . . . . . . . . . . . . . . . . 56

3-2 Sample performance metrics in the original feature set . . . . . . . . . . . . . . 59

3-3 Confusion matrix of classification results . . . . . . . . . . . . . . . . . . . . . . 65

3-4 Performance metric correlation matrixes of test applications . . . . . . . . . . . 70

4-1 Normalized prediction MSE statistics for resources of VM1 . . . . . . . . . . . . 96





4-6 Best predictors of all the trace data . . . . . . . . . . . . . . . . . . . . . . . . . 100

5-1 Performance feature list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5-2 SPECseis96 total cost ratio schedule for the eight performance features . . . . . 122

5-3 Average phase prediction accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 124

5-4 Performance feature list of VM traces . . . . . . . . . . . . . . . . . . . . . . . . 124

5-5 Average phase prediction accuracy of the five VMs . . . . . . . . . . . . . . . . 126

8

LIST OF FIGURES

Figure page

1-1 Structure of an autonomic element. . . . . . . . . . . . . . . . . . . . . . . . . . 16

1-2 Classification system representation . . . . . . . . . . . . . . . . . . . . . . . . . 19

1-3 Virtual machine structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1-4 VMPlant architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2-1 Sample of principal component analysis . . . . . . . . . . . . . . . . . . . . . . . 28

2-2 k-nearest neighbor classification example . . . . . . . . . . . . . . . . . . . . . . 31

2-3 Application classification model . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2-4 Performance feature space dimension reductions in the application classificationprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2-5 Sample clustering diagrams of application classifications . . . . . . . . . . . . . 39

2-6 Application class composition diagram . . . . . . . . . . . . . . . . . . . . . . . 42

2-7 System throughput comparisons for ten different schedules . . . . . . . . . . . . 43

2-8 Application throughput comparisons of different schedules . . . . . . . . . . . . 44

3-1 Sample Bayesian network generated by feature selector . . . . . . . . . . . . . . 54

3-2 Feature selection model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3-3 Bayesian-network based feature selection algorithm for application classification 60

3-4 Average classification accuracy of 10 sets of test data versus number of featuresselected in the first experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3-5 Two-class test data distribution with the first two selected features . . . . . . . 63

3-6 Five-class test data distribution with first two selected features . . . . . . . . . . 66

3-7 Comparison of distances between cluster centers derived from expert-selectedand automatically selected feature sets . . . . . . . . . . . . . . . . . . . . . . . 66

3-8 Training data clustering diagram derived from expert-selected and automat-ically selected feature sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3-9 Classification results of benchmark programs . . . . . . . . . . . . . . . . . . . . 69

4-1 Virtual machine resource usage prediction prototype . . . . . . . . . . . . . . . 78

4-2 Sample XML schema of the VM performance DB . . . . . . . . . . . . . . . . . 80

9

4-3 Learning-aided adaptive resource predictor workflow . . . . . . . . . . . . . . . 87

4-4 Learning-aided adaptive resource predictor dataflow . . . . . . . . . . . . . . . . 88

4-5 Best predictor selection for trace VM2 load15 . . . . . . . . . . . . . . . . . . . 92

4-6 Best predictor selection for trace VM2 PktIn . . . . . . . . . . . . . . . . . . . . 93

4-7 Best predictor selection for trace VM2 Swap . . . . . . . . . . . . . . . . . . . . 94

4-8 Best predictor selection for trace VM2 Disk . . . . . . . . . . . . . . . . . . . . 95

4-9 Predictor performance comparison (VM1) . . . . . . . . . . . . . . . . . . . . . 101





5-1 Application resource demand phase analysis and prediction prototype . . . . . . 109

5-2 Resource allocation strategy comparison . . . . . . . . . . . . . . . . . . . . . . 115

5-3 Application resource demand phase prediction workflow . . . . . . . . . . . . . . 129

5-4 Phase analysis of SPECseis96 CPU user . . . . . . . . . . . . . . . . . . . . . . 130

5-5 Phase analysis of WorldCup’98 Bytes In . . . . . . . . . . . . . . . . . . . . . . 133

5-6 Phase analysis of WorldCup’98 Bytes out . . . . . . . . . . . . . . . . . . . . . . 134

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

LEARNING-AIDED SYSTEM PERFORMANCE MODELING IN SUPPORT OFSELF-OPTIMIZED RESOURCE SCHEDULING IN

DISTRIBUTED ENVIRONMENTS

By

Jian Zhang

December 2007

Chair: Renato J. FigueiredoMajor: Electrical and Computer Engineering

With the goal of autonomic computing, it is desirable to have a resource scheduler

that is capable of self-optimization, which means that with a given high-level objective the

scheduler can automatically adapt its scheduling decisions to the changing workload. This

self-optimization capacity imposes challenges to system performance modeling because of

increasing size and complexity of computing systems.

Our goals were twofold: to design performance models that can derive applications’

resource consumption patterns in a systematic way, and to develop performance prediction

models that can adapt to changing workloads. A novelty in the system performance

model design is the use of various machine learning techniques to efficiently deal with

the complexity of dynamic workloads based on monitoring and mining of historical

performance data. In the environments considered in this thesis, virtual machines (VMs)

are used as resource containers to host application executions because of their flexibility in

supporting resource provisioning and load balancing.

Our study introduced three performance models to support self-optimized scheduling

and decision-making. First, a novel approach is introduced for application classification

based on the Principal Component Analysis (PCA) and the k-Nearest Neighbor (k-NN)

classifier. It helps to reduce the dimensionality of the performance feature space and

classify applications based on extracted features. In addition, a feature selection model is

11

designed based on Bayesian Network (BN) to systematically identify the feature subset,

which can provide optimal classification accuracy and adapt to changing workloads.

Second, an adaptive system performance prediction model is investigated based

on a learning-aided predictor integration technique. Supervised learning techniques are

used to learn the correlations between the statistical properties of the workload and the

best-suited predictors.

In addition to a one-step ahead prediction model, a phase characterization model is

studied to explore the large-scale behavior of application’s resource consumption patterns.

Our study provides novel methodologies to model system and application perfor-

mance. The performance models can self-optimize over time based on learning of historical

runs, therefore better adapt to the changing workload and achieve better prediction

accuracy than traditional methods with static parameters.

12

CHAPTER 1INTRODUCTION

The vision of autonomic computing [1] is to improve manageability of complex

IT systems to a far greater extent than current practice through self-configuring,

self-healing, self-optimization, and self-protection. To perform the self-configuration

and self-optimization of applications and associated execution environments and to realize

dynamic resource allocation, both resource awareness and application awareness are

important. In this context, there has been substantial research on effective scheduling

policies [2–6] with given resource and application specifications. While there are several

methods for obtaining resource specification parameters (e.g., CPU, memory, and disk

information from the /proc file system in Unix systems), application specification is

challenging to describe due to the following factors: 1) lack of knowledge and control of

the application source codes, 2) multi-dimensionality of application resource consumption

patterns, and 3) multi-stage resource consumption patterns of long-running applications.

Furthermore, the dynamics of system performance aggravate the difficulties of performance

description and prediction.

In this dissertation, an integrated framework consisting of algorithms and middleware

for resource performance modeling is developed. It includes system performance prediction

models and application resource demand models based on learning of historical executions.

A novelty of the performance model designs is their use of machine learning techniques

to efficiently and robustly deal with the complex dynamical phenomena of the workload

and resource availability. In addition, virtual machines (VMs) are used as resource

containers because they provide a flexible management platform that is useful for both the

encapsulation of application execution environments and the aggregation and accounting

of resources consumed by an application. In this context, resource scheduling becomes

a problem of how to dynamically allocate resources to virtual machines (which host

application executions) to meet the applications’ resource demands.

13

The rest of this chapter is organized as follows: Section 1.1 gives an overview of

resource performance modeling. Sections 1.2, 1.3, and 1.4, briefly introduce autonomic

computing, machine learning, and virtual machine concepts.

1.1 Resource Performance Modeling

Performance is a key criterion in the design, procurement, and use of computer

systems. As such, achieving the highest performance for a given cost becomes the goal

of system designers. In the context of computing, a system could be any collection

of resources including hardware and software components. To measure the system

performance, a set of metrics, which refers to the criteria used to evaluate the performance

of the system, are selected. The following are the definitions of some commonly used

performance metrics [7]:

Response time: The interval between a user’s request and the system response.

Throughput : The rate (request per unit of time) at which the requests can be serviced

by the system. Utilization: The fraction of time the resource is busy servicing requests.

Reliability : The probability that the system will satisfactorily perform the task for

which it was designed or intended, for a specified time and in a specified environment.

Availability : The fraction of the time the system is available to service users’ requests.

In system procurement studies, the cost/performance ratio is commonly used as a

metric for comparing systems. Three techniques for performance evaluation are analytical

modeling, simulation, and measurement. Sometimes it is helpful to use two or more

techniques, either simultaneously or sequentially.

Computer system performance measurements involve monitoring the system while it

is being subjected to a particular workload. In order to perform meaningful measurements,

the workload should be carefully selected based on the services exercised by the workload,

the level of detail, representativeness, and timeliness. Since a real user environment is

generally not repeatable, it is necessary to study the real user environments, observe the

key characteristics, and develop a workload model that can be used repeatedly. This

14

process is called workload characterization. Once a workload model is available, the effect

of changes in the workload and system can be studied in a controlled manner by simply

changing the parameters of the model. Various workload characterization techniques such

as Principal Component Analysis (PCA) and classifications are used to characterize the

workloads under study in this work and will be discussed in the following chapters. In

addition, various machine learning techniques are used to learn the parameters of the

performance model from historical data.

1.2 Autonomic Computing

With technology advances, the number of computing devices keeps increasing.

Management complexity grows with the increasing device volumes, increasing demand

for IT professionals and the corresponding labor cost. With the motivation to free IT

administrators from details of system operation and maintenance while providing 24 x 7

service, IBM started an initiative in 2001 which has been termed autonomic computing [1].

The essence of autonomic computing is to enable self-managed systems, which includes the

following aspects:

Self-configuration: Automated system configuration in accordance with high-level

policies.

Self-healing : Automated system fault detection, diagnoses, and recovery (including

both hardware and software) .

Self-optimization: Continuous system and component performance and efficiency

improvement.

Self-protection: Automated system defense against malicious attacks or cascading

failures.

Autonomic computing presents challenges and opportunities in various areas such

as learning and optimization theory, automated statistical learning, and behavioral

abstraction and models [8]. This dissertation addresses some of the challenges in

15

Managed ElementES

Monitor

Analyze

Execute

Plan

Knowledge

Autonomic ManagerES

Figure 1-1. Structure of an autonomic element.

the application resource performance modeling to support self-configuration and

self-optimization of application execution environments.

Generally, an autonomic system is an interactive collection of autonomic elements :

individual system constituents that contain resources and deliver services to humans and

other autonomic elements. As Figure 1-1 shows, an autonomic element will typically

consist of one or more managed elements coupled with a single autonomic manager that

controls and represents them. The managed element could be a hardware resource, such

as storage, a CPU, or a software resource, such as a database, or a directory service, or

a large legacy system [1]. The monitoring process collects the performance data of the

16

managed element. Inference can be used to analyze the system performance and plan

accordingly. At last, it executes the plan based on the decisions made. In this work,

machine learning is used to gain the knowledge of the system performance under different

circumstances over historical runs.

1.3 Learning

The science of learning plays a key role in the fields of statistics, data mining, and

artificial intelligence, intersecting with areas of engineering and other disciplines. With

advances in computer technology, we currently have the ability to collect, store, and

process large amount of data, as well as to access them from geographically distributed

locations over computer networks. Machine Learning is programming computers to

optimize a performance criterion using example data or past experience. It uses the theory

of statistics in building mathematical models.

Machine learning is a natural solution to automation. It avoids knowledge-intensive

model building and reduces the reliance on expert knowledge. In addition, it can deal

with complex dynamical phenomena and enable the system to adapt to the changing

environments.

Traditionally there are generally three types of learning: supervised learning,

unsupervised learning, and reinforcement learning.

1.3.1 Supervised Learning

In the context of supervised learning, a learning system is a computer program that

makes decisions based on the accumulated experience contained in successfully solved

cases [9]. “Learning” consists of choosing or adapting parameters within the model

structure that work best on the samples at hand and others like them. One of the most

prominent and basic learning tasks is classification or prediction, which is used extensively

in this work. For classification problems, a learning system can be viewed as a higher-level

system that helps build the decision-making system itself, called the classifier. Figure 1-2

illustrates the structure of a classification system and its learning process.

17

The set of potential observations relevant to a particular problem are called features,

which also go by a host of other names, including attributes and variables. Only correctly

solved cases will be used in building the specific classifier, which is called the training

phase of the classification. The pattern of feature values for each case is associated with

the correct classification or decision to form the sample cases, a set which is also called

the training data. Thus, learning in any of these systems can be viewed as a process of

generalizing these observed empirical associations subject to the constrains imposed by

the chosen classifier model. During the testing phase, the customized classifier is used

to associate a specific pattern of observations with a specific class. The learning method

introduced above is a form of supervised learning, which learn by being presented with

preclassified training data.

1.3.2 Unsupervised Learning

Unsupervised learning methods can learn without any human intervention. This

method is particularly useful in situations where data need to be classified or clustered

into a set of classifications but where the classifications are not known in advance. In

other words, it fits the model to observations. It differs from supervised learning by the

fact that there is not a priori output.

1.3.3 Reinforcement Learning

Reinforcement learning refers to a class of problems in machine learning which

postulate an agent exploring an environment in which the agent perceives its current

state and takes actions. A system that uses reinforcement learning is given a positive

reinforcement when it performs correctly and a negative reinforcement when it performs

incorrectly. However, the information of why and how the learning system performed

correctly is not provided to it.

Reinforcement learning algorithms attempt to find a policy for maximizing cumulative

reward for the agent over the course of the problem. The environment is typically

18

ClassesFeatures

Case Format

Sample Cases

Patternsof Feature

Values

CorrectDecisions

LearningSystem

GeneralClassifier

Model

Application-SpecificClassifier

Case to BeClassified

Decision on ClassAssignment

of Case

Training

Testing

Figure 1-2. Classification system representationDuring the training phase, labeled sample cases are used to derive theunknown parameters of the classifier model. During the testing phase, thecustomized classifier is used to associate a specific pattern of observations witha specific class.

formulated as a finite-state Markov decision process (MDP), and reinforcement learning

algorithms for this context are highly related to dynamic programming techniques.

1.3.4 Other Learning Paradigms

In addition to the above three traditional learning methods, there are some other

learning paradigms:

Relational Learning / Structured Prediction: It predicts structure on sets of objects.

For example, it is trained on genome/proteome data with known relationships and can

predict graph structure on new sets of genomes/proteomes.

19

Semi-Supervised Learning : Given a mix of labeled and unlabeled data, it can get

better predictor than just training on labeled data.

Transductive Learning : It trains a classifier to give best predictor on a specific set of

test data.

Active Learning : It chooses or constructs optimal samples to train on next with the

objective to achieve best predictor with fewest labeled samples.

Nonlinear Dimensionality Reduction: It learns underlying complex manifolds of data

in high dimensional spaces.

In this work, various learning techniques are used to model the application resource

demand and system performance. These models can help to system to adapt to the

changing workload and achieve higher performance.

1.4 Virtual Machines

Virtual machines were first developed and used in the 1960s, with the best-known

example being IBM’s VM/370 [10]. A “classic” virtual machine (VM) enables multiple

independent, isolated operating systems (guest VM) to run on one physical machine (host

server), efficiently multiplexing system resources of the host machine [10].

A virtual-machine monitor (VMM) is a software layer that runs on a host platform

and provides an abstraction of a complete computer system to higher-level software.

The abstraction created by the VMM is called a virtual machine. Figure 1-3 shows the

structure of virtual machines.

1.4.1 Virtual Machine Characteristics

Virtual machines can greatly simplify system management (especially in environments

such as Grid computing) by raising the level of abstraction from that of the operating

system user to that of the virtual machine to the benefit of the resource providers and

users [11]. The following characteristics of virtual machines make them a highly flexible

and manageable application execution platform:

20

Host hardware

Virtual-machine monitor

Guest operating system

Guest application Guest application…

Figure 1-3. Virtual machine structureA virtual-machine monitor is a software layer that runs on a host platform andprovides an abstraction of a complete computer system to higher-levelsoftware. The host platform may be the bare hardware (Type I VMM) or ahost operating system (Type II VMM). The software running above thevirtual-machine abstraction is called guest software (operating system andapplications).

Configurability: Virtual machines are highly configurable in terms of hardware

(memory, hard disk, and devices) as well as software (operating system, user applications,

and data). It is possible to use on-demand provisioning to adapt the machine configura-

tions to dynamic workloads. For example, recent technical advances of hotplug memory

[12] can support dynamic memory extension of VM guest without shutting down the

system.

Security: Virtual machines allow multiple operating systems (OS) to run on a

physical machine in a secure and isolated fashion. High utilization of physical resources

can be achieved by sharing them among multiple virtual machines.

Checkpoint: Virtual machine state can be easily encapsulated into a set of files, which

are called VM checkpoints. In case of a fault, applications execution can be resumed from

the last checkpoint instead of restarted from the beginning. This checkpoint capacity can

21

help to shorten fault recovery times, especially for long-running applications, and maintain

Service Level Agreements (SLA).

Migration: With the checkpoint capability, migrating a virtual machine is simplified,

and can be achieved by copying a set of files across servers. VM migration can be used

to optimize server load distribution dynamically. Recent technical advances have enabled

instant VM migrations. For example, Xen virtual machines can be migrated in the

order of seconds, and with millisecond downtimes [13]. VMware’s VMotion can support

migration with zero down time [14]. Techniques based on Virtual File System (VFS) has

been studied in [15] to support VM migration across Wide-Area Networks (WANs).

1.4.2 Virtual Machine Plant

VMPlant Grid Service [16] handles virtual machine creation and hosting for classic

virtual machines (e.g. VMware [17]) and user-mode Linux platforms (e.g., UML [18])

via dynamic cloning, instantiation and configuration. The VMPlant has three major

components: Virtual Machine Production Center (VMPC), Virtual Machine Warehouse

(VMWH) and Virtual Machine Information System (VMIS). The VMPC handles the

virtual machine’s creation, configuration and destruction. It employs a configuration

pattern recognition technique to identify opportunities to apply the pre-cached virtual

machine state to accelerate the machine configuration process. The VMWH stores the

pre-cached machine images, monitors them and their host server’s performance and

performs the maintenance activity. The VMIS stores the static and dynamic information

of the virtual machines and their host server. The architecture of the VMPlant is shown in

Figure 1-4.

The VMPlant provides API to VMShop for virtual machine creation, deconstruction,

and monitoring. The VMShop has three major components: VMCreater, VMCollecter and

VMReporter. The VMCreater handles the virtual machines’ creation; The VMCollecter

handles the machines’ deconstruction and suspension; The VMReporter handles

information request. In combination with a virtual machine shop service, VMPlants

22

VMProd.Order(VMID, HW, cfgDAG) VMDest.Order (VMID)

VM Production Planner Semi-VM

VM Guest

VM Production Line VM Dissembling Line

VMIS

register deregister

VMID

VM W/H Mgr.

PO’

VMID

PO DO

DO

VMCreator VMCollectorVMShop

VMPlantVM Hostserver

VMReporter

VM Production Planner

Figure 1-4. VMPlant architecture

deployed across physical resources of a site allow clients (users and/or middleware

acting on their behalf) to instantiate and control client-customized virtual execution

environments. The plant can be integrated with virtual networking techniques (such as

VNET [19]) to allow client-side network management. Customized, application-specific

VMs can be defined in VMPlant with the use of a directed acyclic graph (DAG)

configuration. VM execution environments defined within this framework can then be

cloned and dynamically instantiated to provide a homogeneous application execution

environment across distributed resources.

In the context of the VMPlant, an application can be scheduled to run in a specific

virtual machine, which is called applicationVM. Therefore, the system performance metric

collected from the applicationVM can reflect and summarize the resource consumption of

the application.

23

CHAPTER 2APPLICATION CLASSIFICATION BASED ON MONITORING AND LEARNING OF

RESOURCE CONSUMPTION PATTERNS

Application awareness is an important factor of efficient resource scheduling. This

chapter introduces a novel approach for application classification based on the Principal

Component Analysis (PCA) and the k-Nearest Neighbor (k-NN) classifier. This approach

is used to assist scheduling in heterogeneous computing environments. It helps to reduce

the dimensionality of the performance feature space and classify applications based on

extracted features. The classification considers four dimensions: CPU-intensive, I/O

and paging-intensive, network-intensive, and idle. Application class information and the

statistical abstracts of the application behavior are learned over historical runs and used to

assist multi-dimensional resource scheduling.

2.1 Introduction

Heterogeneous distributed systems that serve application needs from diverse users

face the challenge of providing effective resource scheduling to applications. Resource

awareness and application awareness are necessary to exploit the heterogeneities of

resources and applications to perform adaptive resource scheduling. In this context, there

has been substantial research on effective scheduling policies [2–4] with given resource and

application specifications. There are several methods for obtaining resource specification

parameters (e.g., CPU, memory, disk information from /proc in Unix systems). However,

application specification is challenging to describe because of the following factors:

Numerous types of applications: In a closed environment where only a limited number

of applications are running, it is possible to analyze the source codes of each application

or even plug in codes to indicate the application execution stages for effective resource

scheduling. However, in an open environment such as in Grid computing, the growing

number of applications and lack of knowledge or control of the source codes present

the necessity of a general method of learning application behaviors without source code

modifications.

24

Multi-dimensionality of application resource consumption: An application’s execution

resource requirement is often multi-dimensional. That is, different applications may stretch

the use of CPU, memory, hard disk or network bandwidth to different degrees. The

knowledge of which kind of resource is the key component in the resource consumption

pattern can assist resource scheduling.

Multi-stage applications: There are cases where long-running scientific applications

exhibit multiple execution stages. Different execution stages may stress different kinds of

resources to different degrees, hence characterizing an application requires knowledge of

its dynamic run-time behavior. The identification of such stages presents opportunities to

exploit better matching of resource availability and application resource requirement across

different execution stages and across different nodes. For instance, with process migration

techniques [20][21] it is possible to migrate an application during its execution for load

balancing.

The above characteristics of grid applications present a challenge to resource

scheduling: How to learn and make use of an application’s multi-dimensional resource

consumption patterns for resource allocation? This chapter introduces a novel approach

to solve this problem: application classification based on the feature selection algorithm,

Principal Component Analysis (PCA), and K-Nearest Neighbor (k-NN) classifier [22][23].

The PCA is applied to reduce the dimensionality of application performance metrics, while

preserving the maximum amount of variance in the metrics. Then, the k-Nearest Neighbor

algorithm is used to categorize the application execution states into different classes

based on the application’s resource consumption pattern. The learned application class

information is used to assist the resource scheduling decision-making in heterogeneous

computing environments.

The VMPlant service introduced in Chapter 1.4.2 provides automated cloning

and configuration of application-centric Virtual Machines (VMs). Problem-solving

environments such as In-VIGO [24] can submit requests to the VMPlant service, which

25

is capable of cloning an application-specific virtual machine and configuring it with an

appropriate execution environment. In the context of VMPlant, the application can be

scheduled to run on a dedicated virtual machine, which is hosted by a shared physical

machine. Within the VM, system performance metrics such as CPU load, memory usage,

I/O activity and network bandwidth utilization, reflect the application’s resource usage.

The classification system described in this chapter leverages the capability of

summarizing application performance data by collecting system-level data within a

VM, as follows. During the application execution, snapshots of performance metrics are

taken at a desired frequency. A PCA processor analyzes the performance snapshots and

extracts the key components of the application’s resource usage. Based on the extracted

features, a k-NN classifier categorizes each snapshot into one of the following classes:

CPU-intensive, IO-intensive, memory-intensive, network-intensive and idle.

By using this system, resource scheduling can be based on a comprehensive diagnosis

of the application resource utilization, which conveys more information than CPU load

in isolation. Experiments reported in this chapter show that the resource scheduling

facilitated with application class composition knowledge can achieve better average system

throughput than scheduling without the knowledge.

The rest of the chapter is organized as follows: Section 2.2 introduces the PCA and

the k-NN classifier in the context of application classification. Section 2.3 presents the

classification model and implementation. Section 2.4 presents and discusses experimental

results of classification performance measurements. Section 2.5 discusses related work.

Conclusions and future work are discussed in Section 2.6.

2.2 Classification Algorithms

Application behavior can be defined by its resource utilization, such as CPU load,

memory usage, network and disk bandwidth utilization. In principle, the more information

a scheduler knows about an application, the better scheduling decisions it can make.

However, there is a tradeoff between the complexity of decision-making process and the

26

optimality of the decision. The key challenge here is how to find a representation of the

application, which can describe multiple dimensions of resource consumption, in a simple

way. This section describes how the pattern classification techniques, the PCA and the

K-NN classifier, are applied to achieve this goal.

A pattern classification system consists of pre-processing, feature extraction,

classification, and post-processing. The pre-processing and feature extraction are known

to significantly affect the classification, because the error caused by wrong features may

propagate to the next steps and stays predominant in terms of the overall classification

error. In this work, a set of application performance metrics are chosen based on expert

knowledge and the principle of increasing relevance and reducing redundancy [25].

2.2.1 Principal Component Analysis

Principal Component Analysis (PCA) [22] is a linear transformation representing

data in a least-square sense. It is designed to capture the variance in a dataset in terms of

principal components and reduce the dimensionality of the data. It has been widely used

in data analysis and compression.

When a set of vector samples are represented by a set of lines passing through

the mean of the samples, the best linear directions result in eigenvectors of the scatter

matrix - the so-called “principal components” as shown in Figure 2-1. The corresponding

eigenvalues represent the contribution to the variance of data. When the k largest

eigenvalues of n principal components are chosen to represent the data, the dimensionality

of the data reduces from n to k.

Principal component analysis is based on the statistical representation of a random

variable. Suppose we have a random vector population x, where

x = (x1, · · · , xn)T (2–1)

and the mean of that population is denoted by

µx = E {x} (2–2)

27

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

Dimension 1

Dim

ensi

on 2

Principal Component 1


Figure 2-1. Sample of principal component analysis

and the covariance matrix of the same data set is

Cx = E{

(x− µx) (x− µx)T}

(2–3)

The components of Cx, denoted by cij, represent the covariances between the random

variable components xi and xj. The component cii is the variance of the component xi.

From a sample of vectors x1, · · · ,xM , we can calculate the sample mean and the

sample covariance matrix as the estimates of the mean and the covariance matrix.

The eigenvectors ei and the corresponding eigenvalues λi can be obtained by solving

the equation

Cxei = λiei, i = 1, · · · , n (2–4)

28

For simplicity we assume that the λi are distinct. These values can be found, for example,

by finding the solutions of the characteristic equation

|Cx − λI| = 0 (2–5)

where the I is the identify matrix having the same order than Cx and the |·| denotes

the determinant of the matrix. If the data vector has n components, the characteristic

equation becomes of order n.

By ordering the eigenvectors in the order of descending eigenvalues (largest first), one

can create an ordered orthogonal basis with the first eigenvector having the direction of

largest variance of the data. In this way, we can find directions in which the data set has

the most significant amounts of energy.

Suppose one has a data set of which the sample mean and the covariance matrix have

been calculated. Let A be a matrix consisting of eigenvectors of the covariance matrix as

the row vectors.

By transforming a data vector x, we get

y = A (x− µx) (2–6)

which is a point in the orthogonal coordinate system defined by the eigenvectors.

Components of y can be seen as the coordinates in the orthogonal base. We can

reconstruct the original data vector x from y by

x = ATy + µx (2–7)

using the property of an orthogonal matrix A−1 = AT . The AT is the transpose of a

matrix A. The original vector x was projected on the coordinate axes defined by the

orthogonal basis. The original vector was then reconstructed by a linear combination of

the orthogonal basis vectors.

29

Instead of using all the eigenvectors of the covariance matrix, we may represent the

data in terms of only a few basis vectors of the orthogonal basis. If we denote the matrix

having the K first eigenvectors as rows by AK , we can create a similar transformation as

seen above

y = AK (x− µx) (2–8)

and

x = ATKy + µx (2–9)

It means that we project the original data vector on the coordinate axes having the

dimension K and transforming the vector back by a linear combination of the basis

vectors. This method minimizes the mean-square error between the data and the

representation with given number of eigenvectors.

If the data is concentrated in a linear subspace, this method provides a way to

compress data without losing much information and simplifying the representation. By

picking the eigenvectors having the largest eigenvalues we lose as little information as

possible in the mean-square sense.

2.2.2 k-Nearest Neighbor Algorithm

K-Nearest Neighbor classifier (k-NN) is a supervised learning algorithm where the

result of new instance query is classified based on majority of k-nearest neighbor category

[26]. It has been used in many applications in the field of data mining, statistical pattern

recognition, image processing, and many others. The purpose of this algorithm is to

classify a new object based on attributes and training samples. The classifiers do not

use any model to fit and only based on memory. Given a query point, we find k number

of objects or (training points) closest to the query point. The k-NN classifier decides

the class by considering the votes of k (an odd number) nearest neighbors. The nearest

30

Xd1

d2

d3

Class 1

Class 3Class 2

Class centroidsX Test Data

If min( |d1 - d2|, |d1 –d3|, |d2 – d3|) > γ ( predefined threshhold),test data X is qualified training data with class to whose centroid is min(d1, d2 ,d3)

C1

C2

C3

Figure 2-2. k-nearest neighbor classification example

neighbor is picked as the training data geometrically closest to the test data in the feature

space as illustrated in Figure 2-2.

In this work, a vector of the application’s resource consumption snapshots is used

to represent the application. Each snapshot consists of a chosen set of performance

metrics. The PCA is used to preprocess the raw data to independent features for the

classifier. Then, a 3-NN classifier is used to classify each snapshot. The majority vote of

the snapshots’ classes is used to represent the class of the applications: CPU-intensive,

I/O and paging-intensive, network-intensive, or idle. A machine with no load except for

background load from system daemons is considered as in idle state.

2.3 Application Classification Framework

The application classifier is composed of a performance profiler, a classification center,

and an application database (DB) as shown in Figure 2-3. In addition, a monitoring

31

Figure 2-3. Application classification modelThe Performance profiler collects performance metrics of the targetapplication node. The Classification center classifies the application usingextracted key components and performs statistic analysis of the classificationresults. The Application DB stores the application class information. (m is thenumber of snapshots taken in one application run, t0/t1: are the beginningending times of the application execution, VMIP is the IP address of theapplication’s host machine).

system is used to sample the system performance of a computing node running an

application of interest.

2.3.1 Performance Profiler

The performance profiler is responsible for collecting performance data of the

application node. It interfaces with the resource manager to receive data collection

instructions, including the target node and when to start and stop.

32

The performance profiler can be installed on any node with access to the performance

metric information of the application node. In our implementation, the Ganglia [27]

distributed monitoring system is used to monitor application nodes. The performance

sampler takes snapshots of the performance metrics collected by Ganglia at a predefined

frequency (currently, 5 seconds) between the application’s starting time t0 and ending

time t1. Since Ganglia uses multicast based on a listen / announce protocol to monitor the

machine state, the collected samples consist of the performance data of all the nodes in a

subnet. The performance filter extracts the snapshots of the target application for future

processing. At the end of profiling, an application performance data pool is generated.

The data pool consists of a set of n dimensional samples An×m = (a1, a2, · · · , am), where

m = (t1 − t0)/d is the number of snapshots taken in one application run and d is the

sampling time interval. Each sample ai consists of n performance metrics, which include

all the default 29 metrics monitored by Ganglia and the 4 metrics that we added based

on the need of classification, including the number of I/O blocks read from/written to

disk, and the number of memory pages swapped in/out. A program was developed to

collect these four metrics (using vmstat) and the metrics were added to the metric list of

Ganglia’s gmond.

2.3.2 Classification Center

The classification center has three components: the data preprocessor, the PCA

processor, and the classifier. To reduce the computation intensity and improve the

classification accuracy, it employs the PCA algorithm to extract the principal components

of the resource usage data collected and then performs classification based on extracted

data of the principal components.

2.3.2.1 Data preprocessing based on expert knowledge

Based on the expert knowledge, we identified 4 pairs of performance metrics as

shown in Table 2-1. Each pair of the performance metrics correlates to the resource

consumption behavior of the specific application class and has limited redundancies.

33

An×m

︷︸︸︷

a11 · · · a1m

a21 · · · a2m

.... . .

......

. . ....

an1 · · · anm

Preprocess−−−−−−−−−−→

n ≥ p

A′

p×m

︷︸︸︷

a′

11· · · a′

1m

a′

21· · · a′

2m

.... . .

...a′

p1· · · a′

pm

PCA−−−−−−→

p ≥ q

Bq×m

︷︸︸︷

b11 · · · b1m

.... . .

...bq1 · · · bqm

Classify−−−−−−−−→

q ≥ 1

C1×m

︷︸︸︷

(c11 · · · c1m)Vote

−−−−−−→Class

Figure 2-4. Performance feature space dimension reductions in the applicationclassification processm: The number of snapshots taken in one application run,n: The number of performance metrics,An×m: All performance metrics collected by monitoring system,A’p×m: The selected relevant performance metrics after the zero-mean andunit-variance normalization,Bq×m: The extracted key component metrics,C1×m: The class vector of the snapshots,Class: The application class, which is the majority vote of snapshots’ classes.

For example, performance metrics of CPU System and CPU User are correlated to

CPU-intensive applications; Bytes In and Bytes Out are correlated to Network-intensive

applications; IO BI and IO BO are correlated to the IO-intensive applications; Swap In

and Swap Out are correlated to Memory-intensive applications. The data preprocessor

extracts these eight metrics of the target application node from the data pool based on our

expert knowledge. Thus it reduces the dimension of the performance metric from n = 33

to p = 8 and generates A’p×m as shown in Figure 2-4. In addition, the preprocessor also

normalizes the selected metrics to zero-mean and unit-variance.

2.3.2.2 Feature selection based on principal component analysis

The PCA processor takes the data collected for the performance metrics listed in

Table 2-1 as inputs. It conducts the linear transformation of the performance data and

selects the principal components based on the predefined minimal fraction variance. In

our implementation, the minimal fraction variance was set to extract exactly two principal

components. Therefore, at the end of processing, the data dimension gets further reduced

from p = 8 to q = 2 and the vector Bq×m is generated, as shown in Figure 2-4.

34

Table 2-1. Performance metric list

Performance Metrics DescriptionCPU System / User Percent CPU System / UserBytes In / Out Number of bytes per second

into / out of the networkIO BI / BO Blocks sent to / received from

block device (blocks/s)Swap In / Out Amount of memory swapped

in / out from / to disk (kB/s)

2.3.2.3 Training and classification

The 3-Nearest Neighbor classifier is used for the application classification in our

implementation. It is trained by a set of carefully chosen applications based on expert

knowledge. Each application represents the key performance characteristics of a class. For

example, an I/O benchmark program, PostMark [28], is used to represent the IO-intensive

class. SPECseis96 [29], a scientific computing intensive program, is used to represent

the CPU-intensive class. A synthetic application, Pagebench, is used to represent the

Paging-intensive class. It initializes and updates an array whose size is bigger than the

memory of the VM, thereby inducing frequent paging activity. Ettcp [30], a benchmark

that measures the network throughput over TCP or UDP between two nodes, is used as

the training application of the Network-intensive class. The performance data of all these

four applications and the idle state are used to train the classifier. For each test data, the

trained classifier calculates its distance to all the training data. The 3-NN classification

identifies only three training data sets with the shortest distance to the test data. Then

the test data’s class is decided by the majority vote of the three nearest neighbors.

2.3.3 Post Processing and Application Database

At the end of classification, an m dimension class vector c1×m = (c1, c2, · · · , cm)

is generated. Each element of the vector c1×m represents the class of the corresponding

application performance snapshot. The majority vote of the snapshot classes determines

the application Class. The complete performance data dimension reduction process is

shown in Figure 2-4. In addition to a single value (Class) the application classifier also

35

outputs class composition, which can be used to support application cost models (Section

4.4). The post processed classification results together with the corresponding execution

time (t1 − t0) are stored in the application database and can be used to assist future

resource scheduling.

2.4 Experimental Results

We have implemented a prototype for application classification including a Perl

implementation of the performance profiler and a Matlab implementation of the

classification center. In addition, Ganglia was used to monitor the working status of

the virtual machines. This section evaluates our approach from the following three aspects:

the classification ability, the scheduling decision improvement and the classification cost.

2.4.1 Classification Ability

The application class set in this experiment has four classes: CPU-intensive, I/O and

paging-intensive, network-intensive, and idle. Application of I/O and paging-intensive

class can be further divided into two groups based on whether they have or do not have

substantial memory intensive activities. Various synthetic and benchmark programs,

scientific computing applications and user interactive applications are used to test

the classification ability. These programs represent typical application behaviors of

their classes. Table 2-2 summarizes the set of applications used as the training and the

testing applications in the experiments [28–38]. The 3-NN classifier was trained with the

performance data collected from the executions of the training applications highlighted in

the table. All the application executions were hosted by a VMware GSX virtual machine

(VM1). The host server of the virtual machine was an Intel(R) Xeon(TM) dual-CPU

1.80GHz machine with 512KB cache and 1GB RAM. In addition, a second virtual

machine with the same specification was used to run the server applications of the network

benchmarks.

36

Tab

le2-

2.Lis

tof

trai

nin

gan

dte

stin

gap

plica

tion

s

Exp

ecte

d A

pplic

atio

n B

ehav

ior

App

licat

ion

Des

crip

tion

SPE

Cse

is96

**A

sei

smic

pro

cess

ing

appl

icat

ion

[16]

.

Sim

ple

Scal

arA

com

pute

r ar

chite

ctur

e si

mul

atio

n to

ol [

18].

CH

3DA

Cur

vilin

ear-

grid

Hyd

rody

nam

ics

3D m

odel

[19

].

Pos

tMar

k**

A f

ile s

yste

m b

ench

mar

k pr

ogra

m [

15].

Pag

eben

ch*

Asy

nthe

ticpr

ogra

mw

hich

initi

ates

and

upda

tes

anar

ray

who

sesi

zeis

bigg

erth

anth

em

emor

yof

the

virt

ual m

achi

ne.

Bon

nie

A U

nix

file

sys

tem

per

form

ance

ben

chm

ark

[20]

.

Stre

amA

synt

hetic

benc

hmar

kpr

ogra

mth

atm

easu

res

sust

aina

ble

mem

ory

band

wid

than

dth

eco

rres

pond

ing

com

puta

tion

rate

for

sim

ple

vect

or k

erne

ls [

24].

Ett

cp*

A b

ench

mar

k th

at m

easu

res

the

netw

ork

thro

ughp

ut o

ver

TC

P/U

DP

betw

een

two

node

s [1

7].

Aut

oben

chA

wra

pper

aro

und

http

erf

to w

ork

toge

ther

as

an a

utom

ated

web

ser

ver

benc

hmar

k [2

5].

Net

PIPE

A p

roto

col i

ndep

ende

nt n

etw

ork

perf

orm

ance

mea

sure

men

t too

l [21

].

Post

mar

k_N

FST

he P

ostm

ark

benc

hmar

k w

ith a

NFS

mou

nted

wor

king

dir

ecto

ry.

Sftp

A s

ynth

etic

pro

gram

whi

ch u

ses

sftp

to tr

ansf

er a

2G

B s

ize

file

VM

DA

mol

ecul

ar v

isua

lizat

ion

prog

ram

usi

ng 3

-D g

raph

ics

and

built

-in s

crip

ting

[22]

.

XSp

imA

MIP

S as

sem

bly

lang

uage

sim

ulat

or w

ith a

n X

-Win

dow

s ba

sed

GU

I [23

].

Idle

Idle

*N

o ap

plic

atio

n ru

nnin

g ex

cept

bac

kgro

und

daem

ons

in th

e m

achi

ne.

CP

U I

nten

sive

I/O

& P

agin

g In

tens

ive

Net

wor

k In

tens

ive

Inte

ract

ive

37

Initially the performance profiler collected data of all the thirty-three (n = 33)

performance metrics once every five seconds (d = 5) during the application execution.

Then the data preprocessor extracted the data of the eight (p = 8) metrics listed in

Table 2-1 based on the expert knowledge of the correlation between these metrics and the

application classes. After that, the PCA processor conducted the linear transformation of

the performance data and selected principal components based on the minimal fraction

variance defined. In this experiment, the variance contribution threshold was set to extract

two (q = 2) principal components. It helps to reduce the computational requirements of

the classifier. Then, the trained 3-NN classifier conducts classification based on the data of

the two principal components.

The training data’s class clustering diagram is shown in Figure 2-5 (a). The diagram

shows a PCA-based two-dimensional representation of the data corresponding to the five

classes targeted by our system. After being trained with the training data, the classifier

classifies the remaining benchmark programs shown in Table 2-2. The classifier provides

outputs in two kinds of formats: the application class-clustering diagram, which helps to

visualize the classification results, and the application class composition, which can be

used to calculate the unit application cost.

Figure 2-5 shows the sample clustering diagrams for three test applications. For

example, the interactive VMD application (Figure 2-5(d)) shows a mix of the idle class

when user is not interacting with the application, the I/O-intensive class when the user

is uploading an input file, and the Network-intensive class while the user is interacting

with the GUI through a VNC remote display. Table 2-3 summarizes the class compositions

of all the test applications. Figure 2-6 visualizes the class composition of some sample

benchmark programs. These classification results match the class expectations gained from

empirical experience with these programs. They are used to calculate the unit application

cost shown in section 4.4.

38

−8 −6 −4 −2 0 2 4−3

−2

−1

0

1

2

3

4


Prin

cip

al C

om

po

ne

nt

2

IdleIOCPUNETMEM

A

−8 −6 −4 −2 0 2 4−3

−2

−1

0

1

2

3

4

Principal Component 1P

rin

cip

al C

om

po

ne

nt

2

IdleCPU

B

−8 −6 −4 −2 0 2 4−3

−2

−1

0

1

2

3

4


Prin

cip

al C

om

po

ne

nt

2

IdleNET

C

−8 −6 −4 −2 0 2 4−3

−2

−1

0

1

2

3

4


Prin

cip

al C

om

po

ne

nt

2

IdleIONET

D

Figure 2-5. Sample clustering diagrams of application classifications. A)Trainingdata:Mixture B)SimpleScalar:CPU-intensive C)Autobench:Network-intensiveD)VMD:Interactive Principal Component 1 and 2 are the principal componentmetrics extracted by PCA.

39

Tab

le2-

3.E

xper

imen

taldat

a:ap

plica

tion

clas

sco

mpos

itio

ns

App

lica

tion

Cla

ssT

est

App

lica

tion

s#

of S

ampl

esId

leI/

OC

PU

Net

wor

kP

agin

g

CP

USP

EC

seis

96_A

*3,

434

-0.

26%

99.7

1%-

0.03

%

Inte

nsiv

eSP

EC

seis

96_C

*11

2-

-10

0%-

-

CH

3D45

--

100%

--

Sim

pleS

cala

r62

--

100%

--

Post

Mar

k52

-96

.15%

--

3.85

%

Bon

nie

94-

86.1

7%4.

26%

-9.

57%

SPE

Cse

is96

_B*

5,15

00.

21%

42.8

7%50

.39%

-6.

52%

Stre

am96

1.04

%79

.17%

--

19.7

9%

Post

Mar

k_N

FS77

--

-10

0%-

Net

PIPE

744.

05%

4.05

%-

91.8

9%-

Aut

oben

ch17

2-

--

100%

-

Sftp

46-

2.17

%-

97.8

3%-

VM

D86

37.2

1%40

.70%

-22

.09%

-

XSp

im9

22.2

2%77

.78%

--

-

I/O

and

pag

ing

Inte

nsiv

e

Net

wor

k In

tens

ive

Idle

+ O

ther

s**

40

In addition, the experimental data also demonstrate the impact of changing execution

environment configurations on the application’s class composition. For example, in

Table 2-3 when SPECseis96 with medium size input data was executed in VM1 with

256MB memory (SPECseis96 A), it is classified as CPU-intensive application. In the

SPECseis96 B experiment, the smaller physical memory (32MB) resulted in increased

paging and I/O activity. The increased I/O activity is due to the fact that less physical

memory is available to the O/S buffer cache for I/O blocks. The buffer cache size at run

time was observed to be as small as 1MB in SPECseis96 B, and as large as 200MB in

SPECseis96 A. In addition, the execution time gets increased from 291 minutes and 42

seconds in the first case to 426 minutes 58 seconds in the second case.

Similarly, in the experiments with PostMark, different execution environment

configurations changed the application’s resource consumption pattern from one class to

another. Table 2-3 shows that if a local file directory was used to store the files to be read

and written during the program execution, the PostMark benchmark showed the resource

consumption pattern of the I/O-intensive class. In contrast, with an NFS mounted file

directory, it (PostMark NFS) was turned into a Network-intensive application.

2.4.2 Scheduling Performance Improvement

Two sets of experiments are used to illustrate the performance improvement that a

scheduler can achieve with the knowledge of application class. These experiments were

performed on 4 VMware GSX 2.5 virtual machines with 256MB memory each. One of

these virtual machines (VM1) was hosted on an Intel(R) Xeon(TM) dual-CPU 1.80GHz

machine with 512KB cache and 1GB RAM. The other three (VM2, VM3, and VM4) were

hosted on an Intel(R) Xeon(TM) dual-CPU 2.40GHz machine with 512KB cache and 4GB

RAM. The host servers were connected by Gigabit Ethernet.

The first set of experiments demonstrates that the application class information can

help the scheduler to optimize resource sharing among applications running in parallel to

improve system throughput and reduce throughput variances. In the experiments, three

41

Application Class Compositions

0% 20% 40% 60% 80% 100%SPEC_M_256SPEC_S_256

CH3DSimpleScalar

PostMarkBonnie

SPEC_M_32Stream

PostMark_NFSNetPIPE

AutobenchSftpVMD

Appli

catio

ns

PercentageIdle I/O CPU Network Paging

Figure 2-6. Application class composition diagram

applications – SPECseis96 (S) with small data size, PostMark (P) with local file directory

and NetPIPE Client (N) – were selected, and three instances of each application were

executed. The scheduler’s task was to decide how to allocate the nine application instances

to run on the 3 virtual machines (VM1, VM2 and VM3) in parallel, each of which hosted

3 jobs. The VM4 was used to host the NetPIPE server. There are ten possible schedules

available, as shown in Figure 2-7.

When multiple applications run on the same host machine at the same time, there

are resource contentions among them. Two scenarios were compared: in the first scenario,

the scheduler did not use class information, and one of the ten possible schedules was

42

Figure 2-7. System throughput comparisons for ten different schedules1:{(SSS),(PPP),(NNN)}, 2:{(SSS),(PPN),(PNN)}, 3:{(SSP),(SPP),(NNN)},4:{(SSP),(SPN),(PNN)}, 5:{(SSP),(SNN),(PPN)}, 6:{(SSN),(SPP),(PNN)},7:{(SSN),(SPN),(PPN)}, 8:{(SSN),(SNN),(PPP)}, 9:{(SPP),(SPN),(SNN)},10:{(SPN),(SPN),(SPN)}S – SPECseis96 (CPU-intensive), P – PostMark (I/O-intensive),N – NetPIPE (Network-intensive).

selected at random. The other scenario used application class knowledge, always allocating

applications of different classes (CPU, I/O and network) to run on the same machine

(Schedule 10, Figure 2-7). The system throughputs obtained from runs of all possible

schedules in the experimental environment are shown in Figure 2-7.

The average system throughput of the schedule chosen with class knowledge was

1391 jobs per day. It achieved the highest throughput among the ten possible schedules –

22.11% larger than the weighted average of the system throughputs of all the ten possible

schedules. In addition, the random selection of the possible schedules resulted in large

variances of system throughput. The application class information can be used to facilitate

the scheduler to pick the optimal schedule consistently. The application throughput

comparison of different schedules on one machine is shown in Figure 2-8. It compares the

43

Figure 2-8. Application throughput comparisons of different schedules. MIN, MAX, andAVG are the minimum, maximum, average application throughput of all theten possible schedules. SPN is the proposed schedule 10 {(SPN), (SPN),(SPN)} in Figure 2-7.

Table 2-4. System throughput: concurrent vs. sequential executions

Execution Elapsed CH3D PostMark Time Taken toTime (sec) Finish 2 JobsConcurrent 613 310 613Sequential 488 264 752

throughput of schedule ID 10 (labeled SPN in Figure 2-8) with the minimum, maximum,

and average throughputs of all the ten possible schedules. By allocating jobs from different

classes to the machine, the three applications’ throughputs were higher than average by

different degrees: SPECseis96 Small by 24.90%, Postmark by 48.13%, and NetPIPE by

4.29%. Figure 2-8 also shows that the maximum application throughputs were achieved

by sub-schedule (SSN) for SPECseis96 and (PPN) for NetPIPE instead of the proposed

(SPN). However, the low throughputs of the other applications in the sub-schedule make

their total throughputs sub-optimal.

44

The second set of experiments illustrates the improved throughput achieved by

scheduling applications of different classes to run concurrently over running them

sequentially. In the experiments, a CPU intensive application (CH3D) and an I/O

intensive application (PostMark) were scheduled to run in one machine. The execution

time for concurrent and sequential executions is shown in Table 2-4. The experiment

results show that the execution efficiency losses caused by the relatively moderate resource

contentions between applications of different classes were offset by the gains from the

utilization of idle capacity. The resource sharing of applications of different classes

improved the overall system throughput.

2.4.3 Classification Cost

The classification cost is evaluated based on the unit sample processing time in the

data extraction, PCA, and classification stage. Two physical machines were used in this

experiment: The performance filter in Figure 2-3 was running on an Intel(R) Pentium(R)

4 CPU 1.70GHz machine with 512MB memory. In addition, the application classifier was

running on an Intel(R) Pentium(R) III 750MHz machine with 256MB RAM.

In this experiment, a total of 8000 snapshots were taken with five-second intervals

for the virtual machine, which hosted the execution of SPECseis96 (medium). It took the

performance filter 72 seconds to extract the performance data of the target application

VM. In addition, it took another 50 seconds for the classification center to train the

classifier, perform the PCA feature selection and the application classification. Therefore

the unit classification cost is 15 ms per sample data, demonstrating that it is possible to

consider the classifier for online training.

2.5 Related Work

Feature selection [39][25] and classification techniques have been applied to many

areas successfully, such as intrusion detection [40][41][42][43], text categorization [44], and

image and speech analysis. Kapadia’s evaluation of learning algorithms for application

performance prediction in [45] shows that the nearest-neighbor algorithm has better

45

performance than the locally weighted regression algorithms for the tools tested. Our

choice of k-NN classification is based on conclusions from [45]. This thesis differs from

Kapadia’s work in the following ways: First, the application class knowledge is used to

facilitate the resource scheduling to improve the overall system throughput in contrast

with Kapadia’s work, which focuses on application CPU time prediction. Second, the

application classifier takes performance metrics as inputs. In contrast, in [45] the CPU

time prediction is based on the input parameters of the application. Third, the application

classifier employs PCA to reduce the dimensionality of the performance feature space. It is

especially helpful when the number of input features of the classifier is not trivial.

Condor uses process checkpoint and migration techniques [20] to allow an allocation

to be created and preempted at any time. The transfer of checkpoints may occupy

significant network bandwidth. Basney’s study in [46] shows that co-scheduling of CPU

and network resources can improve the Condor resource pool’s goodput, which is defined

as the allocation time when a remotely executing application uses the CPU to make

forward progress. The application classifier presented in this thesis performs learning of

application’s resource consumption of memory and I/O in addition to CPU and network

usage. It provides a way to extract the key performance features and generate an abstract

of the application resource consumption pattern in the form of application class. The

application class information and resource consumption statistics can be used together

with recent multi-lateral resource scheduling techniques, such as Condor’s Gang-matching

[47], to facilitate the resource scheduling and improve system throughput.

Conservative Scheduling [4] uses the prediction of the average and variance of the

CPU load of some future point of time and time interval to facilitate scheduling. The

application classifier shares the common technique of resource consumption pattern

analysis of a time window, which is defined as the time of one application run. However,

the application classifier is capable to take into account usage patterns of multiple kinds of

resources, such as CPU, I/O, network and memory.

46

The skeleton-based performance prediction work introduced in [48] uses a synthetic

skeleton program to reproduce the CPU utilization and communication behaviors of

message passing parallel programs to predict application performance. In contrast, the

application classifier provides application behavior learning in more dimensions.

Prophesy [49] employs a performance-modeling component, which uses coupling

parameters to quantify the interactions between kernels that compose an application.

However, to be able to collect data at the level of basic blocks, procedures, and loops,

it requires insertion of instrumentation code into the application source code. In

contrast, the classification approach uses the system performance data collected from

the application host to infer the application resource consumption pattern. It does not

require the modification of the application source code.

Statistical clustering techniques have been applied to learn application behavior

at various levels. Nickolayev et al applied clustering techniques to efficiently reduce

the processor event trace data volume in cluster environment [50]. Ahn and Vetter

conducted application performance analysis by using clustering techniques to identify the

representative performance counter metrics [51]. Both Cohen and Chase’s [52] and our

work perform statistical clustering using system-level metrics. However, their work focuses

on system performance anomaly detection. Our work focuses on application classification

for resource scheduling.

Our work can be used to learn the resource consumption patterns of parallel

application’s child process and multi-stage application’s sub-stage. However, in this

study we focus on sequential and single-stage applications.

2.6 Conclusion

The application classification prototype presented in this chapter shows how to apply

the Principal Component Analysis and K-Nearest Neighbor techniques to reduce the

dimensions of application resource consumption feature space and assist the resource

scheduling. In addition to the CPU load, it also takes the I/O, network, and memory

47

activities into account for the resource scheduling in an effective way. It does not require

modifications of the application source code. Experiments with various benchmark

applications suggest that with the application class knowledge, a scheduler can improve

the system throughput 22.11% on average by allocating the applications of different classes

to share the system resources.

In this work, the input performance metrics are selected manually based on expert

knowledge. In the next chapter, the techniques for automatically selecting features for

application classification are discussed.

48

CHAPTER 3AUTONOMIC FEATURE SELECTION FOR APPLICATION CLASSIFICATION

Application classification techniques based on monitoring and learning of resource

usage (e.g., CPU, memory, disk, and network) have been proposed in Chapter 2 to aid in

resource scheduling decisions. An important problem that arises in application classifiers

is how to decide which subset of numerous performance metrics collected from monitoring

tools should be used for the classification. This chapter presents an approach based on

a probabilistic model (Bayesian Network) to systematically select the representative

performance features, which can provide optimal classification accuracy and adapt to

changing workloads.

3.1 Introduction

Awareness of application resource consumption patterns (such as CPU-intensive, I/O

and paging-intensive and network-intensive) can facilitate the mapping of workloads to

appropriate resources. Techniques of application classification based on monitoring and

learning of resource usage can be used to gain application awareness [53]. Well-known

monitoring tools such as the open source packages Ganglia [54] and dproc [55], and

commercial products such as HP’s Open View [56] provide the capability of monitoring

a rich set of system level performance metrics. An important problem that arises is how

to decide which subset of numerous performance metrics collected from monitoring tools

should be used for the classification in a dynamic environment. In this chapter we address

this problem. Our approach is based on autonomic feature selection and can help to

improve the system’s self-manageability [1] by reducing the reliance on expert knowledge

and increasing the system’s adaptability.

The need for autonomic feature selection and application classification is motivated by

systems such as VMPlant [16], which provides automated resource provisioning of Virtual

Machine (VM). In the context of VMPlant, the application can be scheduled to run on a

dedicated virtual machine, whose system level performance metrics reflect the application’s

49

resource usage. An application classifier categorizes the application into different classes

such as CPU-intensive, disk I/O-intensive, and network-intensive based on the selected

VM performance metrics.

To build an autonomic classification system with self-configurability, it is critical

to devise a systematic feature selection scheme that can automatically choose the most

representative features for application classification and adapt to changing workloads. This

chapter presents an approach of using a probabilistic model, the Bayesian Network, to

automatically select the performance metrics that correlate with application classes and

optimize the classification accuracy. The approach also uses the Mahalanobis distance to

support online selection of training data, which enables the feature selection to adapt to

dynamic workloads. In the rest of this dissertation, we will use the terms “metrics” and

“features” interchangeably.

In chapter 2, a subset of performance metrics were manually selected based on expert

knowledge to correlate to the resource consumption behavior of the application class.

However, expert knowledge is not always available. In case of highly dynamic workload or

mass volume of performance data, the approach of manual configuration by human expert

is also not feasible. These present a need for a systematic way to select the representative

metrics in the absence of sufficient expert knowledge. On the other hand, the use of

the Bayesian Network leaves the option open to integrate expert knowledge with the

automatic feature selection to improve the classification accuracy and efficiency.

Feature selection based on static selected application performance data, which are

used as the training set, may not always provide the optimal classification results in

dynamic environments. To enable the feature selection to adapt to the changing workload,

it requires the system to be able to dynamically update the training set with data from

recent workload. A question that arises is how to decide which data should be selected

as training data. In this work, an algorithm based on Mahalanobis distance is used

50

to identify the training data which can represent the resource consumption pattern of

corresponding application class.

Our experimental results show the following. First, we observe correlations between

pairs of selected performance metrics, which justifies the use of Mahalanobis distance

as a means of taking the correlation into account in the training data selection process.

Second, there is a diminishing return of classification utility function (i.e. the ratio of

classification accuracy over the number of selected metrics) as more features are selected.

The experiments showed that above 90% application classification accuracy can be

achieved with a small subset of performance metrics which are highly correlated with the

application class. Third, the application classification based on the selected features for a

set of benchmark programs and scientific applications matched our empirical experience

with these applications.

The rest of the chapter is organized as follows: The statistical techniques used

are described in Section 3.2. Section 3.3 presents the feature selection model. Section

3.4 presents and discusses the experimental results. Section 3.5 discusses related work.

Conclusions and future work are discussed in Section 3.6.

3.2 Statistical Inference

3.2.1 Feature Selection

Feature selection is a process that selects an optimal subset of original features based

on an evaluating criterion. The evaluation criterion in this work is the classification

accuracy. A typical feature selection process consists of four steps: subset generation,

subset evaluation, stopping criterion, and result validation [57]. Subset generation is a

process of heuristic search of candidate subsets. Each subset is evaluated based on the

evaluation criterion. Then the evaluation result is compared with the previously computed

best result. If it is better, it will replace the best result and the process continues until

the stop criterion is reached. The selection result is validated by different tests or prior

knowledge.

51

There are two major types of feature selection algorithms: the filter model and the

wrapper model. The filter model relies on general characteristics of data to evaluate

and select feature subsets without involving any mining algorithm. However, the

wrapper model requires one predetermined mining algorithm and uses its performance

as the evaluation criterion. In this work, a wrapper model is used to search for features

better suited to the classification algorithm (Bayesian Network) aiming to improve the

classification accuracy. Our model employs a forward wrapper model based on Bayesian

Network. This model is introduced in detail in Section 3.3.2.

3.2.2 Bayesian Network

A Bayesian Network (BN) is a directed acyclic graph (DAG) with a conditional

probability distribution for each node. Each node represents a domain variable, and each

arc between nodes represents a probabilistic dependency [58]. It can be used to compute

the conditional probability of a node, given the values of its predecessors; hence, a BN can

be used as a classifier that gives the posterior probability distribution of the class decision

node given the values of other nodes.

Bayesian Networks are based on the work of the mathematician and theologian

Rev. Thomas Bayes, who worked with conditional probability theory in the late 1700s

to discover a basic law of probability, which was then called Bayes’ rule. The Bayes’ rule

includes a hypothesis, past experience, and evidence:

where we can update our belief in hypothesis H given the additional evidence E, and

the background context (past experience), c.

The left-hand term, P (H|E, c) is called the posterior probability, or the probability of

hypothesis H after considering the effect of the evidence E on past experience c.

The term P (H|c) is called the a-priori probability of H given c alone.

The term P (E|H, c) is called the likelihood and gives the probability of the evidence

assuming the hypothesis H and the background information c is true.

52

Finally, the last term P (E|c) is independent of H and can be regarded as a

normalizing or scaling factor.

Bayesian Networks capture Bayes’ rule in a graphical model. They are very effective

for modeling situations where some information is already known and incoming data is

uncertain or partially unavailable (unlike rule-based or “expert” systems, where uncertain

or unavailable data results in ineffective or inaccurate reasoning). This robustness in

the face of imperfect knowledge is one of the many reasons why Bayesian Networks are

increasingly used as an alternative to other AI representational formalisms. Bayesian

networks have been applied to many areas successfully, including map learning [59],

medical diagnosis [60][61], and speech and vision processing [62][63]. Compared with

other predictive models, such as decision trees and neural networks, and standard feature

selection model that is based on Principal Component Analysis (PCA), Bayesian networks

also have the advantage of interpretability. Human experts can easily understand the

network structure and modify it to obtain better predictive models. By adding decision

nodes and utility nodes, BN models can also be extended to decision networks for decision

analysis [64].

Consider a domain U of n variables, x1, · · · , xn. Each variable may be discrete having

a finite or countable number of states, or continuous. Given a subset X of variables xi,

where xi ∈ U , if one can observe the state of every variable in X, then this observation

is called an instance of X and is denoted as X = p(xi|x1, · · · , xi−1, ξ) = p(xi|Πi, ξ)kx

for the observations xi = ki, xi ∈ X. The “joint space” or U is the set of all instances

of U . p(X = kx|Y = kξ, ξ) denotes the “generalized probability density” that X =

p(xi|x1, · · · , xi−1, ξ) = p(xi|Πi, ξ)kx given Y = kξ for a person with current state

information ξ. p(X|Y, ξ) then denotes the “Generalized Probability Density Function”

(gpdf) for X, given all possible observations of Y . The joint gpdf over U is the gpdf for U .

A Bayesian network for domain U represents a joint gpdf over U . This representation

consists of a set of local conditional gpdfs combined with a set of conditional independence

53

Figure 3-1. Sample Bayesian network generated by feature selector

assertions that allow the construction of a global gpdf from the local gpdfs. As shown

previously, the chain rule of probability can be used to ascertain these values:

p(x1, · · · , xk|ξ) =k∏

i=1

p(xi|x1, · · · , xi−1, ξ) (3–1)

One assumption imposed by Bayesian Network theory (and indirectly by the Product

Rule of probability theory) is that each variable xi, Πi ⊆ {xi, · · · , xi−1} must be a set of

variables that renders xi and {x1, · · · , xi−1} conditionally independent. In this way:

p(xi|x1, · · · , xi−1, ξ) = p(xi|Πi, ξ) (3–2)

A Bayesian Network Structure then encodes the assertions of conditional independence

in Equation 3–1 above. Essentially then, a Bayesian Network Structure BS is a directed

acyclic graph such that each variable in U corresponds to a node in BS, and the parents of

the node corresponding to xi are the nodes corresponding to the variables in Πi.

Depending on the problem that is defined, either (or both) of the topology and

the probability distribution of Bayesian Network can be pre-defined by hand or may be

54

learned. In this work, the Bayesian Network with a tree structure and full observability

is assumed. Figure 3-1 gives a sample BN learned in the experiment. The root is the

application class decision node, which is used to decide an application class given the value

of the leaf nodes. The root node is the parent of all other nodes. The leaf nodes represent

selected performance metrics, such as network packets sent and bytes written to disk.

They are connected one to another in a series.

3.2.3 Mahalanobis Distance

The Mahalanobis distance is a measure of distance between two points in the

multidimensional space defined by multidimensional correlated variables [22][65]. For

example, if x1 and x2 are two points from the distribution which is characterized by

covariance matrix Σ−1, then the quantity

((x1 − x2)T Σ−1(x1 − x2))

12 (3–3)

is called the Mahalanobis distance from x1 to x2, where T denotes the transpose of a

matrix.

In the cases where there are correlations between variables, simple Euclidean distance

is not an appropriate measure, whereas the Mahalanobis distance can adequately account

for the correlations and is scale-invariant. Statistical analysis of the performance data

in Section 3.4.3 shows that there are correlations between the application performance

metrics with various degrees. Therefore, Mahalanobis distance between the unlabeled

performance sample and the class centroid, which represents the average of all existing

training data of the class, is used in the training data qualification process in Section 3.3.1.

3.2.4 Confusion Matrix

Confusion matrix [66] is commonly used to evaluate the performance of classification

systems. It shows the predicted and actual classification done by the system. The matrix

size is LxL, where L is the number of different classes. In our case where there are five

target application classes, the L is equal to 5.

55

The classification accuracy is measured as the proportion of the total number of

predictions that are correct. A prediction is considered correct if the data is classified to

the same class as its actual class. Table 3-1 shows a sample confusion matrix with L=2.

There are only two possible classes in this example: Positive and negative. Therefore, its

classification accuracy can be calculated as (a+d)/(a+b+c+d).

3.3 Autonomic Feature Selection Framework

Figure 3-2 shows the autonomic feature selection framwork in the context of

application classification. In this section, we are going to focus on introducing the

classification training center, which enables the self-configurability for online application

classification. The training center has two major functions: quality assurance of training

data, which enables the classifier to adapt to changing workloads, and systematic feature

selection, which supports automatic feature selection. The training center consists of three

components: the data quality assuror, the feature selector, and the trainer.

3.3.1 Data Quality Assuror

The data quality assuror (DataQA) is responsible for selecting the training data for

application classification. The inputs of the DataQA are the performance snapshots taken

during the application execution. The outputs are the qualified training data with its

class, such as CPU-intensive.

The training data pool consists of representative data of five application classes

including CPU-intensive, I/O-intensive, memory-intensive, network-intensive, and idle.

Training data of each class c is a set of Kc m-dimensional points, where m is the number

of application-specific performance metrics reported by the monitoring tools. To select the

Table 3-1. Sample confusion matrix with two classes (L=2)

Actual PredictedClass Negative Positive

Negative a bPositive c d

56

Figure 3-2. Feature selection modelThe Performance profiler collects performance metrics of the targetapplication node. The Application classifier classifies the application usingextracted key components and performs statistic analysis of the classificationresults. The DataQA selects the training data for the classification. TheFeature selector selects performance metrics which can provide optimalclassification accuracy. The Trainer trains the classifier using the selectedmetrics of training data. The Application DB stores the application classinformation. (t0/t1: are the beginning / ending times of the applicationexecution, VMIP is the IP address of the application’s host machine).

training data from the application snapshots, only n out of m metrics are extracted based

on previous feature selection result to form a set of Kc n-dimensional training points.

{xkc,1, xkc,2, · · · , xkc,n}, kc = 1, 2, · · · , Kc (3–4)

that comprise a cluster Cc. From [50], it follows that the n-tuple

µc = (x1c , x2c , · · · , xnc) (3–5)

57

where

xic(t) =1

Kc

Kc∑

kc=1

xkc,i, i = 1, 2, · · · , n (3–6)

is called the centroid of the cluster Cc.

The training data selection is a three-step process: First the DataQA extracts the n

out of m metrics of the input performance snapshot to form a training data candidate.

Thus each candidate is represented by an n-dimensional point x = (x1, x2, · · · , xn).

Second, it evaluates whether the input candidate is qualified to be training data

representing one of the application class. At last, the qualified training data candidate

is associated with a scalar value Class, which defines the application class.

The first step is straight-forward. In the second and third steps, the Mahalanobis

distance between the training data candidate x and the centroid µc of cluster Cc is

calculated as follows:

dc(x) = ((x− µc)TΣ−1

c (x− µc))12 (3–7)

where c = 1, 2, · · · , 5 represents the application class and Σ−1c denotes the inverse

covariance matrix of the cluster Cc. The distance from the training data candidate x to

the boundary between two class clusters, for example C1 and C2, is |d1(x) − d2(x)|. If

|d1(x) − d2(x)| = 0, it means that the candidate is exactly at the boundary between

class 1 and 2. The further away the candidate is from the class boundaries, the better it

can represent a class. In other words, there is less probability for it to be misclassified.

Therefore, the DataQA calculates the distance from the candidate to boundaries of all

possible pairs of the classes. If the minimal distance to class boundaries, min(|d1 −d2|, |d1 − d3|, · · · , |d4 − d5|), is bigger than a predefined threshold γ, the corresponding

m-dimensional snapshot of the candidate is determined to be qualified training data of

58

Table 3-2. Sample performance metrics in the original feature set

Performance DescriptionMetricscpu system / user Percent CPU system / user/ idle / idlecpu nice Percent CPU nicebytes in / out Number of bytes per second

into / out of the networkio bi / bo Blocks sent to / received from

a block device (blocks/s)swap in / out Amount of memory swapped

in / out from / to disk (kB/s)pkts in / out Packets in / out per secondproc run Total number of running

processesload one / five One / Five / Fifteen minutes/ fifteen load average

the class, whose centroid has the smallest Mahanalobis distance min(d1, d2, · · · , d5) to the

snapshot. Automated and adaptive threshold setting is discussed in detail in [67].

In our implementation, Ganglia is used as the monitoring tool and twenty (m = 20)

performance metrics, which are related to resource usage, are included in the training

data. These performance metrics include 16 out of 33 default metrics monitored by

Ganglia and the 4 metrics that we added based on the need of classification. The four

metrics include the number of I/O blocks read from/written to disk, and the number of

memory pages swapped in/out. A program was developed to collect these four metrics

(using vmstat) and added them to the metric list of Ganglia’s monitoring daemon gmond.

Table 3-2 shows some sample performance metrics of the training candidate.

The first time quality assurance was performed by human expert at the initialization.

The subsequent assurance can be conducted automatically by following the above steps to

select representative training data for each class.

3.3.2 Feature Selector

Feature selector is responsible for selecting the features which are correlated with

the application’s resource consumption pattern from the numerous performance metrics

59

Input: C(F0, F1, · · · , FN−1) // training data set with N features

Input: Class // class of training data (teacher for learning)

Output: Sbest // selected feature subset

begininitialize Sbest = {}initialize Amax = 0; // maximum accuracy

D = discretize( C ); // convert continuous to discrete features

repeatinitialize Anode = 0; // max accuracy for each node

initialize Fnode = 0; // selected feature for each node

foreach F ∈ ({F0, F1, · · · , FN−1} − Sbest) doAccuracy = eval(D,Class, Sbest ∪ {F});

// evaluate Bayesian network with extra feature Fif Accuracy > Anode then // store the current feature

Anode = Accuracy;Fnode = F ;

end

endif Anode > Amax then

Sbest = Sbest ∪ {Fnode};Amax = Anode;Anode = Anode + 1;

enduntil (Anode ≤ Amax) ;

end

Figure 3-3. Bayesian-network based feature selection algorithm for applicationclassification

collected from monitoring tools. By filtering out metrics which contribute less to the

classification, it can help to not only reduce the computational complexity of subsequent

classifications, but also improve classification accuracy.

In our previous work [53], representative features were selected manually based

on expert knowledge. For example, performance metrics of cpu system and cpu user

are correlated to the behavior of CPU-intensive applications; bytes in and bytes out

are correlated to network-intensive applications; io bi and io bo are correlated to the

I/O-intensive applications; swap in and swap out are correlated to memory-intensive

applications. However, to support on-line classification, it is necessary for feature selection

to have the ability to adapt to changing workloads. Therefore, the static selection

60

conducted by human expert may not be sufficient in a highly dynamic environment.

A feature selection scheme, which can automatically select the representative features

for application classification in a systematic way, can help to solve the problem. This

automated feature selection enables the application classifier to self-configure its input

feature subset to adapt to the changing workload.

A wrapper algorithm based on Bayesian network is employed by the feature selector

to conduct the feature selection. As introduced in Section 3.2.1, although this feature

selection scheme reduces the reliance on human experts’ knowledge, the Bayesian

network’s interpretability leaves the options open to integrate the expert knowledge

into the selection scheme to build a better classification model.

Figure 3-3 shows the feature selection algorithm. It starts with an empty feature

subset Sbest = {}. To search for the best feature F , it uses the temporary feature set

{Sbest ∪ F} to perform Bayesian Network classification for the discrete training data D.

The classification accuracy is calculated by comparing the classification result and the

true answer of the Class information contained in the training data. After the evaluation

of accuracy using all remaining features ({F1, F2, · · · , FN−1} − Sbest), the best accuracy

is stored to Anode. If Anode is better than the previous best accuracy Amax achieved, the

corresponding feature node is added to the feature subset to form the new subset. This

process is repeated until the classification accuracy cannot be improved any more by

adding any of the remaining features into the subset.

3.3.3 Trainer

The classification trainer manages the training of the application classifier. It

monitors the updating status of the training data pool. Every time the DataQA qualifies

new training data, it replaces the oldest data in the training data pool with the new data.

When the percentage of new training data in the pool reaches a predefined threshold (for

example, 80%), the trainer sends a request to the feature selector to start the feature

selection process to generate the updated feature subset. After receiving the updated

61

feature subset, it calls the classifier to perform classification of the data in the updated

training data pool using the old and new feature subsets respectively. Then it compares

the classification accuracy of the two. If the accuracy achieved by the new feature subset

is higher than the one achieved by the previous subset, the selected feature is updated.

Otherwise, it remains the same.

3.4 Experimental Results

We have implemented a prototype for the feature selector based on Matlab. This

section shows the experimental results of feature selection for data collected during the

execution of a set of applications representative of each class (CPU, I/O, memory and

network intensive) and the classification accuracy achieved. In addition, statistical analysis

of the performance metrics was conducted to justify the reason of using Mahalanobis

distance in the training data quality assurance process.

In the experiments, all the applications were executed in a VMware GSX 2.5 virtual

machine with 256MB memory. The virtual machine was hosted on an Intel(R) Xeon(TM)

dual-CPU 1.80GHz machine with 512KB cache and 1GB RAM. The CTC and application

classifier were running on an Intel(R) Pentium(R) III 750MHz machine with 256MB RAM.

3.4.1 Feature Selection and Classification Accuracy

Two sets of experiments were conducted offline to evaluate our feature selection

algorithm. In both experiments, the training data, described by 20 performance metrics,

consists of performance snapshots of applications belonging to different classes. In the

experiments, tenfold cross validation were performed. The training data was randomly

divided into two parts: A combination of 50% of the data from each class was used to

train the feature selector (training set) to derive the feature subset, and the other 50% was

used as test set to validate the features selected by calculating its classification accuracy.

The first experiment was designed to show the relationship between classification

accuracy and the number of features selected. The second experiment was designed to

62

0 2 4 6 80.7

0.75

0.8

0.85

0.9

0.95

1

Number of Selected Features

Cla

ssific

ation A

ccura

cy

Figure 3-4. Average classification accuracy of 10 sets of test data versus number offeatures selected in the first experiment

1000 1200 1400 1600 1800 200016

18

20

22

24

26

28

bytes_in

pa

cke

ts_

in

IOMemory

Figure 3-5. Two-class test data distribution with the first two selected features

63

show that prior-class information can be used to achieve higher classification accuracy

with smaller number of features.

In the first experiment, the training data consist of performance snapshots of five

classes of applications, including CPU-intensive, I/O-intensive, memory-intensive, and

network-intensive applications, and the snapshots collected from an idle application-VM,

which has only “background noise” from system activity (i.e. without any application

execution during the monitoring period). The feature selector’s task is to select those

metrics which can be used to classify the test set into five classes with optimal accuracy.

In all the ten iterations of cross validation, two performance metrics (cpu system and

load fifteen) were always selected as the best two features. Figure 3-6 shows a sample test

data distribution with these two features. If we project the data to x-axis or y-axis, we

can see that it is more difficult to differentiate the data from each class by using either

cpu system or load fifteen alone than using both metrics. For example, the cpu system

value ranges of network-intensive application and I/O-intensive application largely

overlap. This makes it hard to classify these two applications with only cpu system metric.

Compared with the one-metric classification, it is much easier to decide which class the

test data belong to by using information of both metrics. In other words, the combination

of multiple features is more descriptive than a single feature.

The classification accuracy versus the number of features selected for the above

learned Bayesian network is plotted in Figure 3-4. It shows that with a small number

of features (3 to 4), it can achieve above 90% classification accuracy for this 5-class

classification.

In the second experiment, the training data consist of performance snapshots of two

classes of applications, I/O-intensive and memory-intensive. Figure 3-5 shows its test

data distribution with the first two selected features, bytes in and pkts in. A comparison

of Figure 3-6 and Figure 3-5 shows that with reduced number of application classes,

higher classification accuracy can be achieved with less number of features. For example,

64

Table 3-3. Confusion matrix of classification results with expert-selected andautomatically-selected feature sets. A)Automatic B)Expert

Actual Classified asClass Idle CPU IO Net MemIdle 4938 0 62 0 0CPU 231 4746 23 0 0IO 20 86 2888 6 0Net 0 12 8 4980 0Mem 0 0 0 0 5000

A

Actual Classified asClass Idle CPU IO Net MemIdle 4962 0 38 0 0CPU 4 4882 10 0 104IO 20 10 2797 0 173Net 0 0 24 4970 6Mem 3 0 36 0 4961

B

The bold numbers along the diagonal are the number ofcorrectly classified data.

in this experiment, if we know that the application belongs to either I/O-intensive or

memory-intensive class, with two selected features, 96% classification accuracy can be

achieved versus 87% accuracy in the 5-class case. It shows the potential of using pair-wise

classification to improve the classification accuracy for multi-class cases. Using pair-wise

approach for multi-class classification is a topic of future research.

3.4.2 Classification Validation

This set of experiments targets to validate the feature selection experiment results

with the Principal Component Analysis (PCA) and k-Nearest Neighbor (k-NN) based

application classification framework described in [53].

First, the training data distributions based on principal components, which are

derived from automatically selected features in Section 3.4.1 and manually selected

features in previous work [53], are shown in Figure 3-8. Distances between each pair

of class centroids in Figure 3-8 are calculated and ploted in Figure 3-7. It shows that

65

0 20 40 60 80 100−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

cpu_system (%)

loa

d_

fift

ee

n

IdleCPUIONetworkMemory

Figure 3-6. Five-class test data distribution with first two selected features

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

Cluster Pair

Dis

tance

AutomaticExpert

Figure 3-7. Comparison of distances between cluster centers derived from expert-selectedand automatically selected feature sets1:idle-cpu 2:idle-I/O 3:idle-net 4:idle-mem 5:cpu-I/O6:cpu-net 7:cpu-mem 8:I/O-net 9:I/O-mem 10:net-mem

66

−4 −2 0 2 4 6−2

0

2

4

6

8


Princip

al C

om

ponent 2

IdleIOCPUNETMEM

A

−4 −2 0 2 4 6−2

0

2

4

6

8


Princip

al C

om

ponent 2

IdleIOCPUNETMEM

B

Figure 3-8. Training data clustering diagram derived from expert-selected andautomatically selected feature sets A)Automatic B)Expert

67

the distances between 9 out of 10 pairs of cluster centroids are bigger in the automatic

selection case than the expert’s manual selection case. It means that competitively distinct

class clusters can be formed with the 2 principal components which were derived from the

automatically selected features compared with the expert selected features.

Second, the PCA and k-NN based classifications were conducted with both the expert

selected 8 features in previous work [53] and the automatically selected features in Section

3.4.1. Table 3-3 shows the confusion matrices of the classification results. If data are

classified to the same classes as their actual classes, the classifications are considered as

correct. The classification accuracy is the proportion of the total number of classifications

that were correct. The confusion matrices shows that a classification accuracy of 98.05%

can be achieved with automatically selected feature set, which is similar to the 98.14%

accuracy achieved with expert selected feature set. Thus the automatic feature selection

that is based on Bayesian Network can reduce the reliance on expert knowledge while

offering competitive classification accuracy compared to manual selection by human

expert.

In addition, a set of 8 features selected in the 5-class feature selection experiment

in Section 3.4.1 was used to configure the application classifier and the same training

data used in the feature selection experiment were used to train the application classifier.

Then the trained classifier conducted classification for a set of three benchmark programs:

SPECseis96 [29], PostMark and PostMark NFS [28]. SPECseis96 is a scientific application

which is computing-intensive but also exercises disk I/O in the initial and end phases of its

execution. PostMark originally is a disk I/O benchmark program. In PostMark NFS,

a network file system (NFS) mounted directory was used to store the files which

were read/written by the benchmark. Therefore, PostMark NFS performs substantial

network-I/O rather than disk I/O. The classification results are shown in Figure 3-9. The

results show that 86% of the SPECseis96 test data were classified as cpu-intensive, 95%

of the PostMark data were classified as I/O-intensive, and 61% of the PostMark NFS

68

−1 0 1 2 3 4−1

−0.5

0

0.5

1

1.5

2

2.5


Princip

al C

om

pon

en

t 2

IOCPU

A

−1 0 1 2 3 4−1

−0.5

0

0.5

1

1.5

2

2.5


Prin

cip

al C

om

po

ne

nt

2

IOCPU

B

−1 0 1 2 3 4−1

−0.5

0

0.5

1

1.5

2

2.5


Prin

cip

al C

om

po

ne

nt

2

IOCPUNET

C

Figure 3-9. Classification results of benchmark programs A)SPECseis96 B)PostMarkC)PostMark NFS Principal components 1 and 2 are the principal componentmetrics extracted by PCA.

69

Table 3-4. Performance metric correlation matrixes of test applications. A)Correlationmatrix of SPECseis96 performance data B)Correlation matrix of PostMarkperformance data C)Correlation matrix of NetPIPE performance data

Metric 1 2 3 4 5 61 1.00 -0.21 -0.34 0.74 0.20 -0.022 -0.21 1.00 -0.16 -0.02 -0.17 -0.063 -0.34 -0.16 1.00 -0.60 0.20 -0.054 0.74 -0.02 -0.60 1.00 -0.19 0.045 0.20 -0.17 0.20 -0.19 1.00 0.126 -0.02 -0.06 -0.05 0.04 0.12 1.00

A

Metric 1 2 3 4 5 61 1.00 -0.24 0.22 0.34 -0.08 -0.132 -0.24 1.00 -0.22 0.18 0.04 -0.023 0.22 -0.22 1.00 0.33 0.30 0.184 0.34 0.18 0.33 1.00 0.42 0.475 -0.08 0.04 0.30 0.42 1.00 0.206 -0.13 -0.02 0.18 0.47 0.20 1.00

B

Metric 1 2 3 4 5 61 1.00 0.29 0.31 0.48 0.27 0.302 0.29 1.00 0.49 0.39 0.75 0.953 0.31 0.49 1.00 0.50 0.59 0.524 0.48 0.39 0.50 1.00 0.42 0.395 0.28 0.75 0.59 0.42 1.00 0.756 0.30 0.95 0.52 0.39 0.75 1.00

C

1–load five 2–pkts in 3–cpu system4–load fifteen 5–pkts out 6–bytes outCorrelations those are larger than 0.5 are highlightedwith bold characters

data were classified as network-intensive. The results matched with our empirical

experience with these programs and are close to the results of expert-selected-feature

based classificiation, which shows 85% cpu-intensive for SPECseis96, 97% I/O-intensive for

PostMark, and 62% network-intensive for PostMark NFS.

70

3.4.3 Training Data Quality Assurance

This set of experiments shows the need of using Mahalanobis distance in the training

data’s quality assurance testing process.

The data quality assuror classifies each unlabeled test data by identifying its nearest

neighbor among all class centroids. Its performance thus depends crucially on the distance

metric used to identify the nearest class centroid. In fact, a number of researchers have

demonstrated that nearest neighbor classification can be greatly improved by learning an

appropriate distance metric from labeled examples [65].

Table 3-4 shows the correlation coefficients of each pair of the first six performance

metrics collected during the application execution, including load five, pkts in, cpu system,

load fifteen, pkts out, and bytes out. Three applications are used in these experiments

including: SPECseis96 [29], PostMark [28] and NetPIPE [34].

The experiments show that there are correlations between pairs of performance

metrics in various degrees. For example, NetPIPE’s bytes out metric are highly correlated

with its pkts in, pkts out, and cpu system metrics. In the cases where there are

correlations between metrics, distance metric which can take the correlation into

account when determining its distance from the class centroid, should be used. Therefore,

Mahalanobis distance is used in the training data selection process.

3.5 Related Work

Feature selection [39][68] and classification techniques have been applied to many

areas successfully, such as intrusion detection [69][40][42], text categorization [44], speech

and image processing [62][63], and medical diagnosis [60][61].

The following works applied these techniques to analyze system performance.

However they differ from each other from the following aspects: goals of feature selection,

the features under study, and implementation complexity.

Nickolayev et al. used statistical clustering techniques to identify the representative

processors for parallel application performance tuning [50]. Only event tracing of the

71

selected processors are needed to capture the interaction between application and system

components. It helps to reduce the large event data volume, which can perturb the system

performance. Their approach does not require modification of application source code.

Ahn et al. applied various statistical techniques to extract the important performance

counter metrics for application performance analysis [51]. Their prototype can support

parallel application’s performance analysis by collecting and aggregating local data. It

requires annotation of application source code as well as appropriate operating system and

library support to collect process information, which is based on hardware counters.

Cohen et al. [52] studied correlation between component performance metrics and

SLO violation in Internet server platform. There are some similarities between their work

and ours in terms of level of performance metrics under study and type of classifier used.

However, our study differs from theirs in the following ways. First, our study focuses on

application classification (CPU-intensive, I/O and paging intensive, and network-intensive)

for resource scheduling. Their study focused on performance anomaly detection (SLO

violation and compliance). Second, our prototype targets to support online classification.

It addressed the training data qualification problem to adapt the feature selection to

changing workload. However, online training data selection problems were not the focus

of [52]. Third, in our prototype, virtual machines were used to host application executions

and summarize application’s resource usage. The prototype supports a wide range of

applications, such as scientific programs and business online transaction system. [52]

studied web application in three-tier client/server systems.

In addition to [52], Aguilera et al. [70] and Magpie [71] also studied performance

analysis of distributed systems. However, they considered message-level traces of system

activities instead of system level performance metrics. Both of them treated components

of distributed systems as black-box. Therefore, their approaches do not require application

and middleware modifications.

72

3.6 Conclusion

The autonomic feature selection prototype presented in this chapter shows how

to apply statistical analysis techniques to support online application classification. We

envision that this classification approach can be used to provide first-order analysis of

the dominant resource consumption patterns of an application. This chapter shows that

autonomic feature selection enables classification without requiring expert knowledge in

the selection of relevant low-level performance metrics.

73

CHAPTER 4ADAPTIVE PREDICTOR INTEGRATION FOR SYSTEM PERFORMANCE

PREDICTIONS

The integration of multiple predictors promises higher prediction accuracy than the

accuracy that can be obtained with a single predictor. The challenge is how to select the

best predictor at any given moment. Traditionally, multiple predictors are run in parallel

and the one that generates the best result is selected for prediction. In this chapter, we

propose a novel approach for predictor integration based on the learning of historical

predictions. Compared with the traditional approach, it does not require running all the

predictors simultaneously. Instead, it uses classification algorithms such as k-Nearest

Neighbor (k-NN) and Bayesian classification and dimension reduction technique such

as Principal Component Analysis (PCA) to forecast the best predictor for the workload

under study based on the learning of historical predictions. Then only the forecasted best

predictor is run for prediction.

4.1 Introduction

Grid computing [72] enables entities to create a Virtual Organization (VO) to share

their computation resources such as CPU time, memory, network bandwidth, and disk

bandwidth. Predicting the dynamic resource availability is critical to adaptive resource

scheduling. However, determining the most appropriate resource prediction model a priori

is difficult due to the multi-dimensionality and variability of system resource usage. First,

the applications may exercise the use of different type of resources during their executions.

Some resource usages such as CPU load may be relatively smoother whereas others such

as network bandwidth are bustier. It is hard to find a single prediction model which works

best for all types of resources. Second, different applications may have different resource

usage patterns. The best prediction model for a specific resource of one machine may not

wok best for another machine. Third, the resource performance fluctuates dynamically due

to the contention created by competing applications. Indeed, in the absence of a perfect

prediction model, the best predictor for any particular resource may change over time.

74

This chapter introduces a Learning-Aided Adaptive Resource Predictor (LARPre-

dictor), which can dynamically choose the best prediction model suited to the workload at

any given moment. By integrating the prediction results generated by the best predictor

of each moment during the application run, the LARPredictor can outperform any single

predictor in the pool. It differs from the traditional mix-of-expert resource prediction

approach in that it does not require running multiple prediction models in parallel all

the time to identify the best predictors. Instead, the Principal Component Analysis

(PCA) and classification algorithm such as k-Nearest Neighbor (k-NN) are used to

forecast the best prediction model from a pool based on the monitoring and learning of

the historical resource availability and the corresponding prediction performance. The

learning-aided adaptive resource performance prediction can be used to support dynamic

VM provisioning by providing accurate prediction of the resource availability of the host

server and the resource demand of the applications that are reflected by the hosting

virtual machines.

Our experimental results based on the analysis of a set of virtual machine trace data

show:

1. The best prediction model is workload specific. In the absence of a perfect

prediction model, it is hard to find a single predictor which works best across virtual

machines which have different resource usage patterns.

2. The best prediction model is resource specific. It is hard to find a single predictor

which works best across different resource types.

3. The best prediction model for a specific type of resource of a given VM trace varies

as a function of time. The LARPredictor can adapt the predictor selection to the change

of the resource consumption pattern.

4. In the experiments with a set of trace data, The LARPredictor outperformed the

observed single best predictor in the pool for 44.23% of the traces and outperformed the

cumulative-MSE based prediction model used in the Network Weather Service system

75

(NWS) [73] for 66.67% of the traces. It has the potential to consistently outperform any

single predictor for variable workloads and achieve 18.63% lower MSE than the model used

in the NWS.

The rest of the chapter is organized as follows: Section 4.2 gives an overview of

related work. Section 4.4 describes the linear time series prediction models used to

construct the LARPredictor and Section 4.5 describes the learning techniques used for

predictor selection. Section 4.6 details the work flow of the learning-aided adaptive

resource predictor. Section 4.7 discusses the experimental results. Section 4.8 summarizes

the work and describes future direction.

4.2 Related Work

Time series analysis has been studied in many areas such as financial forecasting [74],

biomedical signal processing [75], and geoscience [76]. In this work, we focus on the time

series modeling for computer resource performance prediction.

In [77] and [78], Dinda et al. conducted extensive study of the statistical properties

and the predictions of host load. Their work indicates that CPU load is strongly

correlated over time, which implies that history-based load prediction schemes are feasible.

They evaluated the predictive power of a set of linear models including autoregression

(AR), moving average (MA), autoregression integrated moving average (ARIMA),

autoregression fractionally integrated moving average (ARFIMA), and window-mean

models. Their results show that the AR model is the best in terms of high prediction

accuracy and low overhead among the models they studied. Based on their conclusion, the

AR model is included in our predictor pool to leverage its performance.

To improve the prediction accuracy, various adaptive techniques have been exploited

by the research community. In [4], Yang et al. developed a tendency-based prediction

model that predicts the next value according to the tendency of the time series change.

Some increment/decrement value are added/subtracted to the current measurement

based on the current measurement and some other dynamic information to predict the

76

next value. Zhang et al. improved the performance of tendency-based model by using

a polynomial fitting method to generate predictions based on the data several steps

backward [79]. In addition, in [80], Liang et. al proposed a multi-resource prediction

model that uses both the autocorrelation of the CPU load and the cross correlation

between the CPU load and free memory to achieve higher CPU load prediction accuracy.

Vazhkudai et al. [81][82] used linear regression to predict the data transfer time from

network bandwidth or disk throughput.

The Network Weather Service (NWS) [73] performs prediction of both network

throughput and latency for host machines distributed with different geographic distances.

Both the NWS and the LARPredictor use the mix-of-expert approach to select the

best predictor at any given moment. However, they differ from each other in the way

of best predictor selection. The prediction model used in the NWS system runs a

set of predictors in parallel to track their prediction accuracies. A cumulative error

measurement, Mean Square Error (MSE), is calculated for each predictor. The one that

generates the lowest prediction error for the known measurements is chosen to make a

forecast of future measurement values. Section 4.6 shows that the LARPredictor only uses

parallel prediction during the training phase. In the testing phase, it uses the PCA and

k-NN classifier to forecast the best predictor for the next value based on the learning of

historical prediction performances. Only the forecasted best predictor is run to predict the

next value.

The mix-of-expert approach has been applied to the text recognition and cate-

gorization area. The combination of multiple classifiers has been proved to be able to

increase the recognition rate in difficult problems when compared with single classifier [83].

Different combination strategies such as weighted voting and probability-based voting and

dimensionality reduction based on concept indexing are introduced in [84].

4.3 Virtual Machine Resource Prediction Overview

This section gives an overview of virtual machine resource prediction.

77

VMM

VMGuest

VMGuest

…

VM Host

MonitorAgent

VM Performance

DBProfiler

LARPredictorResource Manager

VMIDDeviceID(ts, te)

PredictionDB

PredictionQA

VM Configurationstx

) )( 1,... −− tmt xx

Re-train

)],),...(,[( 11 −−−− ttjtjt xxxx ))VM: Virtual MachineVMM: Virtual Machine MonitorDB: DatabaseQA: Quality Assurorm: Prediction window sizej: Quality assurance window sizets/te: Starting / ending time stamps

Figure 4-1. Virtual machine resource usage prediction prototypeThe monitor agent, which is installed in the Virtual Machine Monitor (VMM),collects the VM resource performance data and stores them in the round robinVM Performance Database. The profiler extracts the performance data of agiven time frame for the VM indicated by VMID and deviceID. TheLARPredictor select the best prediction model based on learning of historicalpredictions, predicts the resource performance for time t+1, and stores theprediction results in the prediction database. The prediction results can beused to support the resource manager to perform dynamic VM resourceallocation. The Performance Quality Assuror (QA) audits the LARPredictor’sperformance and orders re-training for the predictor if the performance dropsbelow a predefined threshold.

Our virtual machine resource prediction prototype, illustrated in Figure 4-1, models

how the VM performance data are collected and used to predict the value for future time

to support resource allocation decision-making.

A performance monitoring agent is installed in the Virtual Machine Monitor (VMM)

to collect the performance data of the guest VMs. In our implementation, VMware’s ESX

virtual machines are used to host the application execution and the vmkusage tool [85]

of ESX is used to monitor and collect the performance data of the VM guests and host

78

from the server host machine’s / proc nodes. The vmkusage tool samples every minute,

and updates its data every five minutes with an average of the one-minute statistics over

the given five-minute interval. The collected data is stored in a Round Robin Database

(RRD). Table 2-1 shows the list of performance features under study in this work.

The profiler retrieves the VM performance data, which are identified by vmID,

deviceID, and a time window, from the round robin performance database. The data of

each VM device’s performance metric form time series (xt−m+1, · · · ,xt) with an identical

interval, where m is the data retrieval window size. The retrieved performance data with

the corresponding time stamps are stored in the prediction database. The [vmID, deviceID,

timeStamp, metricName] forms the combinational primary key of the database. Figure 4-2

shows the XML schema of the database and sample database records of virtual machines

such as VM1, which has one CPU, two Network Interface Cards (NIC), and two virtual

hard disks.

The LARPredictor takes the time series performance data (yt−m, · · · ,yt−1) as inputs,

selects the best prediction model based on the learning of historical prediction results,

and predicts the resource performance yt of future time. The detail description of the

LARPredictor’s work flow is given in Section 4.6. The predicted results are stored in

the prediction DB and can be used to support the resource manager’s dynamic VM

provisioning decision-making.

The Prediction Quality Assuror (QA) is responsible for monitoring the LARPredic-

tor’s performance, in terms of MSE. It periodically audits the prediction performance by

calculating the average MSE of historical prediction data stored in the prediction DB.

When the average MSE of the data in the audit window exceeds a predefined threshold,

it directs the LARPredictor to re-train the predictors and the classifier using recent

performance data stored in the database.

79

<?xml version="1.0" encoding="ISO-8859-1"?>

<xport>

<meta><start>1154979300</start> % Time stamp of first sample<step>300</step> % Sampling interval

<end>1155066000</end> % Time stamp of last sample<rows>290</rows> % Total number of samples<columns>12</columns> % Total number of performance features

<legend><entry>cpu_usedsec</entry>

<entry>cpu_ready</entry><entry>mem_size</entry><entry>mem_swapped</entry>

<entry>net1_rKB</entry><entry>net1_wKB</entry>

<entry>net2_rKB</entry><entry>net2_wKB</entry><entry>hd1_r</entry>

<entry>hd1_w</entry><entry>hd2_r</entry><entry>hd2_w</entry>

</legend></meta>

</xport>

Figure 4-2. Sample XML schema of the VM performance DB

4.4 Time Series Models for Resource Performance Prediction

Time series is defined as an ordered sequence of values of a variable at equally spaced

time intervals. A general linear process {Zt} is one that can be represented as a weighted

linear combination of the present and past terms of a white noise process:

Zt = at + Ψ1at−1 + Ψ2at−2 + · · · (4–1)

where {Zt} denotes the observed time series, {at} denotes an unobserved white noise

series, and {Ψi} denotes the weights. In this thesis, performance snapshots of virtual

machine’s resources including CPU, memory, disk, and network bandwidth are taken

periodically to form the time series {Zt} under study.

80

Time series analysis accounts for the fact that those data points taken over time

may have an internal structure (such as autocorrelation, trend, or seasonal variation)

that should be accounted for. The purpose of time series analysis is generally two-fold:

to understand or model the stochastic mechanism that gives rise to an observed series

and to predict or forecast future values of a series based on the history of that series [86].

Time series analysis techniques have been widely applied to forecasting in many areas such

as economic forecasting, sales forecasting, stock market analysis, communication traffic

control, and workload projection. In this work, simple time series models, such as LAST,

sliding-window average (SW AVG), and autoregressive (AR), are used to construct the

LARPredictor to support online prediction. However, the LARPredictor prototype may be

generally used with other prediction models studied in [78][73][4].

LAST model : The LAST model predicts all future values to be the same as the last

measured value:

Zt = Zt−1 (4–2)

SW AVG model : The sliding-window average model predicts the future values by

taking the average over a fixed-length history:

Zt =1

m

t−1∑i=t−m

Zi (4–3)

AR model : A pth-order autoregressive process Zt can be represented as follows:

Zt = Ψ1Zt−1 + Ψ2Zt−2 + · · ·+ ΨpZt−p + at (4–4)

The current value of the series Zt is a linear combination of the p latest past values

of itself plus a term at, which incorporates everything new in the series at time t that is

not explained by the past values. Yule-Walker technique is used in the AR model fitting in

this work.

81

Generally, LAST performs better for smooth trace data and AR performs better for

peaky data. In this thesis, an approach to dynamically construct a resource predictor

using multiple predictors such as LAST, AR, and SW AVG is proposed to predict the VM

resource performance.

The prediction performance is measured in mean squared error (MSE) [87], which

is defined as the average squared difference between independent observations and

predictions from the fitted equation for the corresponding values of the independent

variables.

MSE(θ) = E[(θ − θ)2] (4–5)

where θ is the estimator of a parameter θ in a statistical model.

4.5 Algorithms for Prediction Model Selection

In the absence of a perfect generation model, the best resource prediction model

varies with the machine workload. Learning algorithms are used to learn the relationship

between the workload and suited prediction model. In this work, classification algorithms

are used to forecast the best prediction model for a given workload based on the learning

of historical predictions. In addition, the Principal Component Analysis (PCA) technique

is used to reduce the computational cost of the classification process by reducing the

dimension of the feature space of the input data of the classifiers.

There are two types of classifiers: nonparametric and parametric. The parametric

classifier exploits prior information to model the feature space. When the assumed

model is correct, parametric classifiers outperform nonparametric ones. In contrast, the

nonparametric classifiers do not make such assumption and are more robust. However, the

nonparametric classifiers tend to suffer from curse of dimensionality, which means that the

demand of a number of samples grows exponentially with the dimensionality of the feature

space. In this section, we are going to introduce a nonparametric classifier, k-NN classifier,

and a parametric classifier, Bayesian classier, which are used for best predictor selection in

82

this work. While we have chosen to use the k-NN and Bayesian classification algorithms

due to its prior success in a large number of classification problems, such as handwritten

digits and satellite image scenes, our methodology may be generally used with other types

of classification algorithms.

4.5.1 k-Nearest Neighbor

The k-Nearest Neighbor (k-NN) classifier is memory-based. Its training data consist

of the N pairs (x1, p1), · · · , (xN , pN) where pi is a class label taking values in 1, 2, · · · , P .

In this work, the P represents the number of prediction models in the pool. The training

data are represented by a set of points in the feature space, where each point xi is

associated with its class label pi. Classification of testing data xj is made to the class

of the closest training data. For example, given a test data xj, the k training data

xr, r = 1, · · · , k closest in distance to xj are identified. The test data is classified by using

the majority vote among the k (an odd number) neighbors.

Since the features under study, such as CPU percentage and network received bytes/-

sec, have different units of measure, all features are normalized to have zero mean and unit

variance [88]. In this work, “closest” is determined by Euclidean distance (Equation 4–6).

dij = ‖xi − xj‖ . (4–6)

As a nonparametric method, the k-NN classifier can be applied to different time series

without modification. To address the problem associated with high dimensionality, various

dimension reduction techniques can be used in the data preprocessing.

4.5.2 Bayesian Classification

The Bayesian classifier is based on the well-known probability theorem, “Bayes

formula”. Suppose that we know both the prior probabilities P (ωj) and the conditional

densities p(x|ωj), where x and ω represent a feature vector and its state (e.g., class),

respectively. The joint probability density can be written in two ways: p(ωj,x) =

83

P (ωj|x)p(x) = p(x|ωj)P (ωj). Rearranging these leads us to “Bayes formula”:

P (ωj|x) =p(x|ωj)P (ωj)

p(x), (4–7)

where in this case of c categories

p(x) =c∑

j=1

p(x|ωj)P (ωj). (4–8)

Then, the posterior probabilities P (ωj|x) can be computed from p(x|ωj) by Bayes

formula. In addition, Bayes formula can be expressed informally in English by saying that

posterior =likelihood× prior

evidence. (4–9)

The multivariate normal density has been applied successfully to a number of

classification problems. In this work the feature vector can be modeled as a multivariate

normal random variable.

The general multivariate normal density in d dimensions is written as

p(x) =1

(2π)d/2 |Σ|1/2·

exp

[−1

2(x− µ)TΣ−1(x− µ)

], (4–10)

where x is a d-component column vector, µ is the d-component mean vector, Σ is the

d-by-d covariance matrix, and |Σ| and Σ−1 are its determinant and inverse, respectively.

Further, we let (x− µ)T denote the transpose of x− µ.

The minimization of the probability of error can be achieved by use of the discrimi-

nant functions

gi(x) = ln P (ωi|x) = ln p(x|ωi) + ln P (ωi). (4–11)

84

This expression can be evaluated if the densities p(x|ωi) are multivariate normal. In

this case, we have

gi(x) = −1

2(x− µi)

TΣ−1(x− µi)−d

2ln 2π

−1

2ln |Σi|+ ln P (ωi). (4–12)

The resulting classification is performed by evaluating discriminant functions. When

the workload have similar statistical property, the Bayesian classifier derived from one

workload trace can be applied to another directly. In case of highly variable workload, the

retraining of the classifier is necessary.

4.5.3 Principal Component Analysis

The Principal Component Analysis (PCA) [22][88], also called Karhunen-Loeve trans-

form, is a linear transformation representing data in a least-square sense. The principal

components of a set of data in <p provide a sequence of best linear approximations to

those data, of all ranks q ≤ p.

Denote the observations by x1,x2, · · · ,xN , and the parametric representation of the

rank-q linear model is as follows:

f(λ) = µ + Vqλ, (4–13)

where µ is a location vector in <p, Vq is a p× q matrix with q orthogonal unit vectors

as columns, which are called eigenvectors, and λ is a vector of q parameters, which are

called eigenvalues. These eigenvectors are the principal components. The corresponding

eigenvalues represent the contribution to the variance of data. Often there will be just a

few (= k) large eigenvalues and this implies that k is the inherent dimensionality of the

subspace governing the data. When the k largest eigenvalues of q principal components are

chosen to represent the data, the dimensionality of the data reduces from q to k.

85

In this work, the PCA is used to reduce the prediction input data dimensions. It

helps to reduce the computing intensity of the subsequent classification process.

4.6 Learning-Aided Adaptive Resource Predictor

This section describes the work flow of the Learning-Aided Adaptive Resource Pre-

dictor (LARPredictor) illustrated in Figure 4-3. The prediction consists of two phases: a

training phase and a testing phase. During the training phase, the best predictors for each

set of training data are identified using the traditional mix-of-expert approach. During

the testing phase, the classifier forecasts the best predictor for the test data based on the

knowledge gained from the training data and historical prediction performance. Then only

the selected best predictor is run to predict the resource performance. Both phases include

the data pre-processing and the Principal Component Analysis (PCA) process.

The features under study in this work, as shown in Table 2-1, include CPU, memory,

network bandwidth, and disk I/O usages. Figure 4-4 illustrates how the features are

processed to form the prediction database. Since the features have different units of

measure, a data pre-processor was used to normalize the input data with zero mean and

unit variance. The normalized data are framed according to the prediction window size to

feed the PCA processor.

4.6.1 Training Phase

The training phase of both the k-NN and the Bayesian classifiers mainly consists

of two processes: Prediction model fitting and best predictor identification. The set of

training data with the corresponding best predictors are used for the k-NN classification in

the testing phase. The unknown parameters of the Bayesian classifier are estimated from

on the training data.

The LAST and SW AVG models do not involve any unknown parameters. They can

be used for predictions directly. The parametric prediction models such as the AR model,

which contain unknown parameters, require model fitting. The model fitting is a process

86

1

Resource Pattern

Normalization

Predictor1Model1

Predictor2

Predictor3

Model2

Model3PCA PCA

…

Find the bestpredictor

Classifier

+

Evaluate

Choosethe best predictor

-t+1

Prediction Performance

Training Data

u×1X

Testing Data

v×1Y

nmu ×+−′′ )1(X

P2P1……

Best PredictorPast History

[ ]21 xx ′′′′[ ]32 xx ′′′′

mmu ×+−′ )1(X

exp. n=2

mmv ×+−′ )1(Y

nmv ×+−′′ )1(Y

v×′1Y)Framing

Pre

-pro

cess Normalization

FramingPre

-pro

cess

u×′1X v×′1Y

ClassificationTraining

Figure 4-3. Learning-aided adaptive resource predictor workflowThe input data are normalized and framed with the prediction window size m.The Principal Component Analysis (PCA) is used to reduce the dimension ofthe input data from the window size m to n(n < m). All prediction models arerun in parallel in the training phase to identify the best predictor for each setof training data. The classifier is used to forecast the best predictor for thetesting data based on the knowledge gained from the training data. Only thebest predictor is used to predict the future value of the testing data.

87

)...,( ,211 uu xxx=×X

)...( ,2,11 uu xxx ′′′=′×X

2.exp21

32

21

)1( ......

=+−+−

×+− ′′′′

′′′′′′′′

=′′

nmumu

nmu

xx

xx

xx

X ′′′

′′′′′′

=′

+−+−

+×+−

umumu

m

m

xxx

xxx

xxx

...

............

...

...

21

132

21

)( m1muXFraming PCA

(n<m)

Normalization

P2

P1

……

Best PredictorPast History

[ ]21 xx ′′′′

[ ]32 xx ′′′′

kNN Training Data

Predict& Select

Figure 4-4. Learning-aided adaptive resource predictor dataflowFirst, the u training data X1×u is normalized to X ′

1×u and subsequently framedto X ′

(u−m+1)×m according to the predictor order m. The PCA processor is usedto reduce the dimension of each set of training data from m to n beforeprediction. Then the predictors are run in parallel with the inputs X ′′

(u−m+1)×n

and the one that gives the smallest MSE is identified as the best predictor tobe associated with the corresponding training data in the prediction database.The dimension reduction of the testing data is similar to the training data’sand is not shown here.

to estimate the unknown parameters of the models. The Yule-Walker equation [86] is used

in the AR model fitting in this work.

For window based prediction models, such as SW AVG and AR, the PCA algorithm

is applied to reduce the input data dimension. The naive mix-of-expert approach

is used to identify the best predictor pi for each set of pre-processed training data

(exp.(x′ix′i+1 . . . x′i+m−1)). All prediction models are run in parallel with the training

data and the one which generates the least MSE of prediction is identified as the best

predictor pi, which is a class label taking values in (LAST, AR, SW AVG) to be associated

with the training data. The u pairs of PCA-processed training data and the corresponding

best predictors [(x′′1, p1), · · · , (x′′u, pu)] form the training data of the classifiers.

As a non-parametric classifier, the k-NN classifier does not have an obvious training

phase. The major task of its training phase is to label the training data with class

definitions. As a parametric classifier, the Bayesian classifier uses the training data to

derive its unknown parameters, which are the mean and covariance matrix of training data

of each class, to form the classification model.

88

4.6.2 Testing Phase

Similar to the training phase, the testing data are normalized using the normalization

coefficient derived from the training phase and framed with the prediction window size

m. Then the PCA is used to reduce the dimension of the preprocessed testing data(y′t−my′t−m+1 . . . y′t−1

)from m to n.

In the testing phase of the LARPredictor that is based on k-NN classifier, the

Euclidean distances between all PCA–processed test data(y′′t−n y′′t−n+1 . . . y′′t−1

)and all

training data X′′(u−1+m)×n in the reduced n dimensional feature space are calculated and

the k (k = 3 in our implementation) training data which have the shortest distances to the

testing data are identified. The majority vote of the k nearest neighbors’ best predictor

will be chosen as the best predictor to predict y′t based on the (y′t−m, y′t−m+1, · · · , y′t−1) in

case of the AR model or the SW AVG model and y′t = y′t−1 in case of the LAST model.

The prediction performance can be obtained by comparing the predicted value y′t with the

normalized observed value y′t.

In the testing phase of the LARPredictor that is based on Bayesian classifier, test

data are preprocessed the same as the k-NN classifier. The PCA–processed test data(y′′t−n y′′t−n+1 . . . y′′t−1

)are plugged into the discriminant function (4–12) derived in

Section 4.5.2. The parameters in the discriminant function for each class, the mean vector

and covariance matrix, are obtained during the training phase. Then, each test data is

classified as the class of the largest discriminant function.

The testing phase differs from the training phase in that it does not require running

multiple predictors in parallel to identify the one which is best suited to the data and

gives the smallest MSE. Instead, it forecasts the best predictor by learning from historical

predictions. The reasoning here is that these nearest neighbors’ workload characteristics

are closest to the testing data’s and the predictor that works best for these neighbors

should also work best for the testing data.

89

4.7 Empirical Evaluation

We have implemented a prototype for the LARPredictor including Perl and Shell

scripts of the profiler to extract and profile the performance data from the round robin

performance database, and a Matlab implementation of the LARPredictor. This section

evaluates the prediction performance of the LARPredictor using traces of five virtual

machines as follows:

VM1 : Hosts a web server, Globus GRAM/MDS and GridFTP services, and a PBS

head node.

VM2 : Hosts a Linux-based port-forwarding proxy for VNC sessions.

VM3 : Hosts a WindowsXP based calendar.

VM4 : Host a web server, a list server, and Wiki server.

VM5 : Host a web server.

These virtual machines were hosted by a physical machine with an Intel(R)

Xeon(TM) 2.0GHz CPU, 4GB memory, and 36GB SCSI disk. VMware ESX server

2.5.2 was running on the physical host. The vmkusage tool was run on the ESX server

to collect the resource performance data of the guest virtual machines every minute and

store them in a round robin database. The profiler was used to extract the data with given

VMID, DeviceID, performance metric, starting and ending time stamps, and intervals. In

this experiment, the performance data of a 24-hour period with 5-minute intervals were

extracted for VM2, VM3, VM4, and VM5. The data of a 7-day period with 30-minute

intervals of VM1 were extracted. The data of a given VMID, DeviceID, and performance

metrics form a time series under study. The time series data were normalized with zero

mean and unit variance.

4.7.1 Best Predictor Selection

This set of experiments illustrates the adaptive predictor selection of the LARPre-

dictor. The k-NN classifier was used to forecast the best predictor among the LAST,

AR, and SW-AVG for the workload understudy. Only the selected best predictor is

90

used for performance prediction. VM2 was used in the experiments. Fig. 4-5 shows the

predictor selections for CPU fifteen minute load average during a 12 hour period with a

sampling interval of 5 minutes. The top plot shows the observed best predictor by running

three prediction models in parallel. The middle plot shows the predictor selection of the

LARPredictor and the bottom plot shows the cumulative-MSE based predictor selection

used in the NWS. Similarly the predictor selection results of the trace data of other

resources are shown as follows: Network packets in per second in Fig. 4-6, total amount of

swap memory in Fig. 4-7, and total disk space in Fig. 4-8.

These experimental results show that the best prediction model for a specific

type of resource of a given trace varies as a function of time. In the experiment, the

LARPredictor can better adapt the predictor selection to the changing workload than

the cumulative-MSE based approach presented in the NWS. The LARPredictor’s average

best predictor forecasting accuracy of all the performance traces of the five virtual

machines is 55.98%, which is 20.18% higher than the accuracy of 46.58% achieved by the

cumulative-MSE based predictor used in the NWS for the workload studied.

4.7.2 Virtual Machine Performance Trace Prediction

This set of experiments is used to check the prediction performance of the LARPre-

dictor. Section 4.7.2.1 shows the prediction accuracy of the k-NN based LARPredictor

and all the predictors in the pool. Section 4.7.2.2 compares the prediction accuracy and

execution time of the k-NN based LARPredictor and the Bayesian based LARPredictor.

In addition, Section 4.7.2.3 benchmarks the performance of the LARPredictors and the

cumulative-MSE based prediction model used in the NWS.

In the experiments, ten-fold cross validation were performed for each set of time series

data. A time stamp was randomly chosen to divide the performance data of a virtual

machine into two parts: 50% of the data was used to train the LARPredictor and the

other 50% was used as test set to measure the prediction performance by calculating its

prediction MSE.

91

Observed

PredictorClass

Adaptive

PredictorClass

Time Index

Cumulative

PredictorClass

0 20 40 60 80 100 120 140

0 20 40 60 80 100 120 140

0 20 40 60 80 100 120 140

0

2

4

0

2

4

0

2

4

Figure 4-5. Best predictor selection for trace VM2 load15Predictor Class: 1 - LAST, 2 - AR, 3 - SW AVG

4.7.2.1 Performance of k-NN based LARPredictor

The k-NN algorithm was used for classification in this experiment. In the training

phase, the training data were used to derive the regression coefficients of the AR model.

In addition, the three prediction models were run in parallel. The prediction error was

calculated by comparing the predicted value with the observed value. For each prediction,

the model that gave the smallest absolute value of the error was identified as the best

predictor to be associated with the corresponding training data.

In the testing phase, the 3NN classifier was used to forecast the best predictors of

the testing data. First, for each set of testing data of the prediction window size, the

PCA was applied to reduce the data dimension from m, which was 5 or 16, to n = 2 in

92

Observed

PredictorClass

Adaptive

PredictorClass

Time Index

Cumulative

PredictorClass

0 20 40 60 80 100 120 140

0 20 40 60 80 100 120 140

0 20 40 60 80 100 120 140

0

2

4

0

2

4

0

2

4

Figure 4-6. Best predictor selection for trace VM2 PktInPredictor Class: 1 - LAST, 2 - AR, 3 - SW AVG

this experiment. Then the Euclidean distances between the test data and all the training

data in the reduced feature space were calculated. The three training data which had

the shortest distances to the testing data were identified and the majority vote of their

associated best predictors was forecasted to be the best predictor of the testing data.

At last, the forecasted best predictor was run to predict the future value of the testing

data. The MSE of each time series was calculated to measure the performance of the

LARPredictor. Tables 4-1, 4-2, 4-3, 4-4, and 4-5 show the prediction performance of

the LARPredictor with current implementation (LAR) and the three prediction models

including LAST, AR, and SW AVG for all resource performance traces of the five virtual

machines. Also shown in these tables is the computed MSE for a perfect LARPredictor

93

120 140

120 140

120 140

Observed

PredictorClass

Adaptive

PredictorClass

Time Index

Cumulative

PredictorClass

0 20 40 60 80 100 120 140

0 20 40 60 80 100 120 140

0 20 40 60 80 100 120 140

0

2

4

0

2

4

0

2

4

Figure 4-7. Best predictor selection for trace VM2 SwapPredictor Class: 1 - LAST, 2 - AR, 3 - SW AVG

(P-LAR). The MSE of the P-LAR model shows the upper bound of the prediction

accuracy that can be achieved by the LARPredictor. The MSE of the best predictor

among LAR, LAST, AR, and SW AVG is highlighted with italic bold numbers.

Table 4-6 shows the best predictor among LAST, AR, and SW AVG for all the

resource performance metrics and VM traces. The symbol “*” indicates the cases in which

the LARPredictor achieved equal or higher prediction accuracy than the best of the three

predictors. Overall, the AR model performed better than the LAST and the SW AVG

models.

The above experimental results show:

94

140

140

140

Observed

PredictorClass

Adaptive

PredictorClass

Time Index

Cumulative

PredictorClass

0 20 40 60 80 100 120 140

0 20 40 60 80 100 120 140

0 20 40 60 80 100 120 140

0

2

4

0

2

4

0

2

4

Figure 4-8. Best predictor selection for trace VM2 DiskPredictor Class: 1 - LAST, 2 - AR, 3 - SW AVG

1. It is hard to find a single prediction model among LAST, AR, and SW AVG

that performs best for all types of resource performance data for a given VM trace. For

example, for the VM1’s trace data shown in Table 4-1, each of the three models (LAST,

AR, and SW) outperformed the other two for a subset of the performance metrics. In this

experiment, only the AR model worked best for the trace data of VM3.

2. It is hard to find a single prediction model among the three that perform best

consistently for a given type of resources across all the VM traces. In the experiment, only

the AR model worked best for the CPU performance predictions.

3. The LARPredictor achieved better-than-expert performances using the

mix-of-expert approach for 44.23% of the workload traces. It shows the potential for the

95

Table 4-1. Normalized prediction MSE statistics for resources of VM1

P-LAR LAR LAST AR SWCPU_usedsec 0.6976 0.9508 1.1436 0.9456 1.0352CPU_ready 0.6775 0.9632 1.1699 0.9579 1.0333Memory_size 0.2071 0.2389 0.2298 0.2379 0.4883Memory_swapped 0.2071 0.2386 0.2298 0.2379 0.4883NIC1_received 0.3981 0.5436 1.836 0.5436 0.9831NIC1_transmitted 0.3776 0.5845 1.8236 0.5845 0.9829NIC2_received 0.9788 0.9912 1.4392 0.9966 1.0397NIC2_transmitted 0.3983 0.5463 1.8406 0.5463 0.9843VD1_read 0.9062 1.0215 1.2849 0.9754 1.0511VD1_write 0.7969 0.9587 1.1905 0.9473 1.0566VD2_read 1 1.2156 1.4191 1.1536 1.035VD2_write 0.662 0.9931 1.1572 0.9929 1.0292

Perf.MetricsPredictors

duration = 168 hours, interval = 30 minutes, prediction order = 16

LARPredictor to outperform any single predictor in the pool and approach the prediction

accuracy of the P-LAR by improving the best predictor forecasting / classification

accuracy. How to further improve the predictor classification accuracy is a topic of our

future research.

4.7.2.2 Performance comparison of k-NN and Bayesian-classifier basedLARPredictor

In this experiment, a set of VM trace with 138,240 performance data were used

to feed the LARPredictor. Half of the data were used for training and the other half

were used for testing. A Bayesian-classifier based LARPredictor was implemented.

Fig. 4-9 shows the prediction performance comparisons between it and the k-NN based

LARPredictor for all the resources of VM1. The profile report of the Matlab program

execution showed that it cost the kNN based LARPredictor 205.8 second CPU time, with

193.5 seconds in the testing phase and 12.3 seconds in the training phase. It took 132.1

96


P-LAR LAR LAST AR SWCPU_usedsec 0.8142 1.1158 1.2476 1.0311 1.0912CPU_ready 0.7873 1.0128 1.2167 1.0166 1.0948Memory_size 0.5328 0.6213 0.637 0.6262 0.79Memory_swapped 0.5328 0.6214 0.637 0.6262 0.7901NIC1_received 0.4872 0.6189 0.6663 0.611 0.6831NIC1_transmitted 0.7581 1.0138 1.0303 1.0209 1.0737NIC2_received 0.6626 0.89 0.8765 0.8923 1.0242NIC2_transmitted 0.7434 0.9924 1.0266 0.9949 1.0775VD1_read 0.9582 1.0467 1.2249 1.0264 1.0912VD1_write 0.7733 1.0744 1.1574 1.0129 1.0748VD2_read 1.0208 1.4153 1.4155 1.0843 1.0972VD2_write 0.7389 0.9941 1.0816 0.9372 1.0792



second CPU time for the Bayesian based LARPredictor to finish execution with 120.8

second testing phase and 11.3 second training phase.

The experimental results show that the prediction accuracy in terms of normalized

MSE of the Bayesian-classifier based LARPredictor is about 3.8% worse than the k-NN

based one. However, it shortened the CPU time of the testing phase by 37.57%.

4.7.2.3 Performance comparison of the LARPredictors and the cumulative-MSE based predictor used in the NWS

This section compares the prediction accuracy of the LARPredictors and the NWS

predictor. Fig. 4-9, 4-10, 4-11, 4-12, and 4-13 shows the prediction accuracy of the perfect

LARPredictor that has 100% best predictor forecasting accuracy (P-LARP), the k-NN

and Bayesian based LARPredictors (KnnLARP and BaysLARP), the cumulative MSE

of all history based predictor used in the NWS (Cum.MSE), and the cumulative-MSE

97


P-LAR LAR LAST AR SWCPU_usedsec 0.9883 1.0395 1.4341 1.0376 1.0989CPU_ready 0.6826 0.9502 1.6594 0.9502 1.0921Memory_size 0.5009 0.6169 0.6818 0.6216 0.7481Memory_swapped 0 0 0 NaN 0NIC1_received 0.7231 1.0009 1.4492 1.0033 1.0501NIC1_transmitted 0.9931 1.0514 1.3068 1.0665 1.0943VD1_read 0 0 0 NaN 0VD1_write 0 0 0 NaN 0VD2_read 0.9728 1.0276 1.3969 1.0281 1.1016VD2_write 0.8696 0.9938 1.245 0.9946 1.0815



based predictor of a fixed window size (n=2 in this experiment) used in the NWS

(W-Cum.MSE).

The experimental results show that without running all the predictors in parallel all

the time, for 66.67% of the traces, the LARPredictor outperformed the cumulative-MSE

based predictor used in the NWS. The perfect LARPredictor shows the potential to

achieve 18.6% lower MSE in average that the cumulative-MSE based predictor.

4.7.3 Discussion

PCA is an optimal way to project data in the mean-square sense. The computational

complexity of estimating the PCA is O(d2W ) + O(d3) for the original set of W ×d-dimensional data [89]. In the context of resource performance time series prediction,

W = 1 and d is the prediction window size. The typical small input data size in this

context makes the use of the PCA feasible. There also exist computationally less expensive

methods [90] for finding only a few eigenvectors and eigenvalues of a large matrix; in our

experiments, we use appropriate Matlab routines to realize these.

98


P-LAR LAR LAST AR SWCPU_usedsec 0.2819 0.3781 1.7 0.3787 1.1859CPU_ready 0.4339 0.59 1.6385 0.5904 1.1689Memory_size 0.3453 0.4638 0.4615 0.4624 0.6628Memory_swapped 0.2042 0.2595 0.2571 0.2596 0.3592NIC1_received 0.7175 0.9853 1.0552 0.9231 0.9313NIC1_transmitted 0.8713 1.0169 1.2649 1.0075 1.0501NIC2_received 0.7026 1.0695 1.1324 1.0253 1.0699NIC2_transmitted 0.8423 1.0276 1.3369 1.0363 1.0753VD1_read 0.7452 0.9679 1.2066 0.9658 0.9832VD1_write 0.6985 0.9766 1.136 0.9836 0.98VD2_read 1.01 1.1296 1.4181 1.0608 1.0973VD2_write 0.8121 1.0134 1.2204 1.0152 1.0474




P-LAR LAR LAST AR SWCPU_usedsec 0.9165 1.0731 1.503 1.0875 1.1023CPU_ready 0.5578 0.964 1.7314 0.8673 1.108Memory_size 0.5569 0.6094 0.66 0.6115 0.8873Memory_swapped 0.5499 0.6043 0.6498 0.6073 0.8719NIC1_received 0 0 0 NaN 0NIC1_transmitted 0 0 0 NaN 0NIC2_received 0.7894 0.9264 1.1422 0.9232 0.887NIC2_transmitted 0.967 1.0162 1.3807 1.014 1.1051VD1_read 1.0165 1.2856 1.3429 1.078 1.0744VD1_write 0.811 0.946 1.0775 0.9376 1.0391VD2_read 0 0 0 NaN 0VD2_write 1.0115 1.0691 1.4048 1.0653 1.0969



99

Table 4-6. Best predictors of all the trace data.The predictors shown in the table have the smallest MSE among all the threepredictors (LAST, AR, and SW AVG). The “*” symbol indicates that theLARPredictor outperforms the best predictor in the predictor pool.

Perf. Metrics VM1 VM2 VM3 VM4 VM5CPU usedsec AR AR AR AR* AR*CPU ready AR AR* AR* AR* ARMem size LAST AR* AR* LAST AR*Mem swap LAST AR* - LAST AR*NIC1 Rx AR* AR AR* AR -NIC1 Tx AR* AR* AR* AR -NIC2 Rx AR* LAST - AR SW AVGNIC2 Tx AR* AR* - AR* ARVD1 read AR AR - AR SW AVGVD1 write AR AR - SW AVG* ARVD2 read SW AVG AR AR* AR -VD2 write AR AR AR* AR* AR

The k-NN does not have an off-line learning phase. The “training phase” in k-NN is

simply to index the N training data for later use. Therefore, its training complexity of

k-NN is O(N) both in time and space. In the testing phase, the k nearest neighbors of

a testing data can be obtained O(N) time by using a modified version of quicksort [91].

There are fast algorithms for finding nearest-neighbors [92][93] also.

Three simple time series models were used in this experiment to show the potential

of using dynamic predictor selection based on learning to improve prediction accuracy.

However, the LARPredictor prototype may be generally used with other more sophisti-

cated prediction models such as these studied in [78][73][4]. Generally, the more predictors

in the pool and the more complex the predictors are, it is more beneficial to use the

LARPredictor because the classification overhead can be better amortized by running only

single predictor at any given time.

4.8 Conclusion

The best prediction model varies with the types of resources and workload from

time to time. We have developed a time series resource prediction model, LARPredictor,

which can adapt the predictor selection to the changing workload. The k-NN classifier and

100

Predictor Performance Comparison (VM1)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10 11 12

Performance Metric ID

No

rmal

ized

MS

E

P-LARP Knn-LARP Bays-LARP Cum.MSE W-Cum.MSE

Figure 4-9. Predictor performance comparison (VM1)1 - CPU usedsec, 2 - CPU ready, 3 - Mem size, 4 - Mem swap,5 - NIC1 rx, 6 - NIC1 tx, 7 - NIC2 rx, 8 - NIC2 tx,9 - VD1 read, 10 - VD1 write, 11 - VD2 read, 12 - VD2 write

the Bayesian classifier are used to forecast the best predictor for the workload based on

the learning of historical load characteristics and prediction performance. The principal

component analysis technique has been applied to reduce the input data dimension of

the classification process. Our experimental results with the traces of the full range

of virtual machine resources including CPU, memory, network and disk show that the

LARPredictor can effectively identify the best predictor for the workload and achieve

prediction accuracies that are close to or even better than any single best predictor.

101

Predictor Performance Comparison (VM2)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10 11 12


No

rmal

ized

MS

E



102

Prediction Performanc Comparison (VM3)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10


No

rmal

ized

MS

E

P-LARP kNN-LARP Bays-LARP Cum.MSE W-Cum.MSE

N/A N/A N/A


103

Prediction Performance Comparison (VM4)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5 6 7 8 9 10 11 12


No

rmal

ized

MS

E



104

Prediction Performance Comparison (VM5)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10 11 12


No

rmal

ized

MS

E


N/AN/AN/A


105

CHAPTER 5APPLICATION RESOURCE DEMAND PHASE ANALYSIS AND PREDICTIONS

Profiling the execution phases of applications can help to optimize the utilization

of the underlying resources. This chapter presents a novel system level application-

resource-demand phase analysis and prediction approach in support of on-demand

resource provisioning. This approach explores large-scale behavior of applications’ resource

consumption, followed by analysis using a set of algorithms based on clustering. The

phase profile, which learns from historical runs, is used to classify and predict future

phase behavior. This process takes into consideration applications’ resource consumption

patterns, phase transition costs and penalties associated with Service-Level Agreements

(SLA) violations.

5.1 Introduction

Recently there has been a renewed interest in using virtual machine(s) (VM) as a

container [94] of the application’s execution environment both in academia and industry

[11][16][95]. This is motivated by the idea of providing computing resources as a utility

and charging the users for a specific usage. For example, in August 2006, Amazon

launched its Beta version of VM-based Elastic Compute Cloud (EC2) web service. EC2

allows users to rent virtual machines with specific configurations from Amazon and can

support changes in resource configurations in the order of minutes. In systems that

allow users to reserve and reconfigure resource allocations and charge based upon such

allocations, users have an incentive to request no more than the amount of resources an

application needs. A question which arises here is: how to adapt the resource provisioning

to the changing workload?

In this chapter, we focus on modeling and analyzing long-running applications’ phase

behavior. The modeling is based on monitoring and learning of the applications’ historical

resource consumption patterns, which likely varies over time. Understanding such behavior

is critical to optimizing resource scheduling. To self-optimize the configuration of an

106

application’s execution environment, we first develop analytical tools necessary to

automatically and efficiently discover similarities and changes in application’s resource

consumption over time, which is referred to as “phase behavior”.

In this context, a phase is defined as a set of intervals within an application’s

execution that have similar system-level resource consumption behavior, regardless of

temporal adjacency. It means that a phase may reappear many times as an application

executes. Phase classification partitions a set of intervals into phases with similar

behavior. In this chapter, we introduce an application resource demand phase analysis

and prediction prototype, which uses a combination of clustering and supervised learning

techniques to investigate the following questions:

1) Is there a phase behavior in the application’s resource consumption patterns? If so,

how many phases should be used to provide optimal resource provisioning?

2) Based on the observations of historical phase behaviors, what is the predicted next

phase of the application’s execution?

3) How do phase transition frequency and prediction accuracy affect resource

allocation? Answers to these questions can be used to decide the time and space allocation

of resources.

To make optimization decisions, this prototype takes the application’s resource

consumption patterns, phase transition costs, and penalties associated with Service

Level Agreement (SLA) violations into account while making optimization decisions.

The prediction accuracy is fed back to guide future phase analysis. This prototype does

not require any instrumentation of the application source codes and can generally work

with both physical and virtual machines which can provide monitoring of system level

performance metrics.

Our experimental results with the CPU and the network performance traces of

SPECseis96 and WorldCup98 access log replay show that:

107

1. The total cost is a function of the number of phases. To best determine the

number of phases used for prediction it is necessary to account for the application’s

resource usage patterns, unit resource cost and unit resource re-provisioning cost

associated with the phase transitions, and the penalty associated with SLA violations

caused by mispredictions.

2. For applications with phase behavior, typically with a small number of phases the

savings gained from phase-based resource reservation can outweigh the costs associated

with the increased number of re-provisioning and the penalties caused by mispredictions.

3. The phase prediction accuracy decreases as the number of phases increases. With

the current prototype, an average of above 90% phase prediction accuracy can be achieved

for the CPU and network performance features where four phases are considered.

The rest of this chapter is organized as follows: Section 5.2 presents the application

phase analysis and prediction model. Section 5.3 and 5.4 detail the algorithms used for

phase analysis and prediction. Section 5.5 presents experimental results. Section 5.6

discusses related work. Section 5.7 draws conclusions and discusses future work.

5.2 Application Resource Demand Phase Analysis and Prediction Prototype

Our application phase analysis and prediction prototype, illustrated in Figure

5-1, models how the application VM’s performance data are collected and analyzed to

construct the corresponding application’s phase profile and how the profile is used to

predict its next phase. In addition, it shows how process quality indicators, such as

phase prediction accuracy, are monitored and used as feedback signals to tune the system

performance (such as application response time) towards the goal defined in the SLA.

A performance monitoring agent is used to collect the performance data of the

application VM, which serves as the application container. The monitoring agent can

be implemented in various ways. In this work, Ganglia [54], a distributed monitoring

system, and the vmkusage tool [85] provided by VMware ESX server, are used to monitor

108

Application

ContainerMonitor

Agent

VM

Performance

DB

Phase

Analyzer

Auditor

Resource

Scheduler

VMID

FeatureID

(ts, te)

Application

Phase DBVM

Configurations

VM: Virtual Machine

VMM: Virtual Machine Monitor

DB: Database

ARM: Application Resource Manager

CQ: Clustering Quality

ARM1

Response

Time

…

Model

Fitting

Clustering Prediction

ARMn

Prediction

Accuracy

CQ

P1, P2,…,Pt

VMID

& Rt+1Phase

Predictor

Pt+1

Phase

ProfileExternal

Figure 5-1. Application resource demand phase analysis and prediction prototypeThe phase analyzer analyzes the performance data collected by the monitoringagent to find out the optimal number of phases n ∈ [1, m]. The output phaseprofile is stored in the application phase database (DB) and will be used astraining data for the phase predictor. The predictor predicts the next phase ofthe application resource usage based on the learning of its historical phasebehaviors. The predicted phase can be used to support the applicationresource manager’s (ARM’s) decisions regarding resource provisioning. Theauditor monitors and evaluates the performance of the analyzer and predictorand orders re-training of the phase predictor with the updated workload profilewhen the performance measurements drop to below a predefined threshold.

the application containers. The collected performance data is stored in the performance

database.

The phase analyzer retrieves the time-series VM performance data, which are

identified by vmID, FeatureID, and a time window (ts, te), from the performance database.

Then it performs phase analysis using algorithms based on clustering to check whether

there is a phase behavior in the application’s resource consumption patterns. If so, it

continues to find out how many phases in a numeric range are best in terms of providing

the minimal resource reservation costs. The output phase profile, which consists of the

109

defined number of phases, corresponding cluster centroids and resource usage statistics

of each phase, is stored in the application phase database. The details of data clustering

algorithms are described in Section 5.3.

The phase profile is used as training data of the phase predictor. In the presence of

phase behavior, the phase predictor can perform on-line prediction of the next phase of the

application’s resource usage based on the learning of historical phase behaviors as shown

in Section 5.4. The predicted phase information can be used to support the application

resource manager ’s decisions regarding resource re-provisioning requests to the resource

scheduler.

The auditor monitors and evaluates the health of the phase analysis and prediction

process by performing quality control of each component. Clustering quality can be

measured by the similarity and compactness of the clusters using various internal indices

introduced in [96]. The phase predictor’s performance is measured by its prediction

accuracy. The application response time is used as an external signal for total quality

control and checked against the Quality of Service (QoS) defined in the SLA. Local

performance tuning is triggered when the auditor observes that the component-level

service quality drops to below a predefined threshold. For example, when the real-time

workload varies to a degree which makes it statistically significantly different from the

training workload, the phase prediction accuracies may drop. Upon detection, the auditor

can order a phase analysis based on recent workload to update the phase profile and

subsequently order a re-training for the phase predictor. If the re-training still can not

improve the total quality of service to a satisfactory level, the resource reservation strategy

falls back from the phase-based reservation to a conservative strategy, which reserves the

largest amount of resources the user is willing to pay during the whole application run.

Automated and adaptive threshold setting is discussed in detail in [67].

110

5.3 Data Clustering

Clustering is an important data mining technique for discovering patterns in the

data. It has been used effectively in many disciplines such as pattern recognition, biology,

geology, and marketing.

At a high-level, the problem of clustering is defined as follows: Given a set U of

n samples u1, u2, · · · , un, we would like to partition U into k subsets U1, U2, · · · , Uk,

such that the samples assigned to each subset are more similar to each other than the

samples assigned to different subsets. Here, we assume that two samples are similar if they

correspond to the same phase.

5.3.1 Stages in Clustering

A typical pattern clustering activity involves the following steps [97]:

(1) Pattern representation, which is used to obtain an appropriate set of features to

use in clustering. It optionally consists of feature extraction and/or selection. Feature

selection is the process of identifying the most effective subset of the original features to

use in clustering. Feature extraction is the use of one or more transformations of the input

features to produce new salient features.

In the context of resource demand phase analysis, the features under study are the

system level resource performance metrics as shown in Table 5-1. For one dimension

clustering, which is the case of this work, the feature selection is as simple as choosing

the performance metric which is instructive to the allocation of the corresponding system

resource. For clustering based on multiple performance metrics, feature extraction

techniques such as Principal Component Analysis (PCA) may be used to transform the

input performance metrics to a lower dimension space to reduce the computing intensity of

subsequent clustering and improve the clustering quality.

(2) Definition of a pattern proximity measure appropriate to the data domain. The

pattern proximity is usually measured by a distance function defined on pairs of patterns.

In this work, the most popular metric for continuous features, Euclidean distance is used

111

to measure the dissimilarity between two patterns. It works well when a data set has

“compact” or “isolated” clusters. In case of clustering in the multi-dimensional space,

normalization of the continuous features can be used to remove the tendency of the

largest-scaled feature to dominate the others. In addition, Mahalanobis distance can be

used to remove the distortion caused by the linear correlation among features as discussed

in Chapter 3.

(3) Clustering or grouping: The clustering can be performed in a number of ways [97].

The output clustering can be hard (a partition of the data into groups) or fuzzy (where

each pattern has a variable degree of membership in each of the output clusters). A hard

clustering can be obtained from a fuzzy partition by thresholding the membership value.

In this work, one of the most popular iterative clustering methods, k-means algorithm as

detailed in Section 5.3.3, is used.

5.3.2 Definitions and Notation

In this chapter, we follow the terms and notation defined in [97]

–A pattern(or feature vector or observation) is a single data item used by the

clustering algorithm. It typically consists of a vector of d measurements.

–The individual scalar components of a pattern are called features (or attributes).

–d is the dimensionality of the pattern or of the pattern space.

–A class refers to a state of nature that governs the pattern generation process

in some cases. More concretely, a class can be viewed as a source of patterns whose

distribution in feature space is governed by a probability density specific to the class.

Clustering techniques attempt to group patterns so that the classes thereby obtained

reflect the different pattern generation processes represented in the pattern set.

–A distance measure is a metric on the feature space used to quantify the similarity of

patterns.

112

5.3.3 k-means Clustering

The k-means is one of the most popular clustering algorithms. It is intended for

situations in which all variables are of the quantitative type, and squared Euclidean

distance is chosen as the dissimilarity measure. The Euclidean distance between two

points xi and xj in a d-dimensional space is written as

d2(xi,xj) =

(d∑

k=1

(xi,k − xj,k)2

)1/2

= ‖xi − xj‖2 , (5–1)

In case of clustering in the multi-dimensional space, normalization of the continuous

features can be used to remove the tendency of the largest-scaled feature to dominate the

others. In addition, Mahalanobis distance can be used to remove the distortion caused by

the linear correlation among features as discussed in chapter 3.

The k-means algorithm works as follows [97]:

(1) Choose k cluster centers to coincide with k randomly-chosen patterns inside the

hyper volume containing the pattern set.

(2) Assign each pattern to the closest cluster center.

(3) Recompute the cluster centers using the current cluster memberships.

(4) If a convergence criterion is not met, go to step 2. Typical convergence criteria

are: no (or minimal) reassignment of patterns to new cluster centers, or minimal decrease

in squared error.

The algorithm has a time complexity of O(n), where n is the number of patterns,

and a space complexity of O(k), where k is the number of clusters. The algorithm is

order-independent; for a given initial seed set of cluster centers, it generates the same

partition of the data irrespective of the order in which the patterns are presented to the

algorithm. However, the algorithm is sensitive to initial seed selection and even in the best

case, it can produce only hyperspherical clusters.

113

5.3.4 Finding the Optimal Number of Clusters

One of the most venerable problems in cluster analysis is to find the optimal number

of clusters in the data. Many statistical methods and computational algorithms have been

developed to answer this question using external indices and/or internal indices [96]. The

best number of clusters in the context of phase analysis discussed in this work is the one

that gives minimal total costs. The process to find out the optimal number of clusters of

the application workload is explained as follows.

Let un = u(t0 + n∆t) denote the resource usage sampled at time t = t0 + n∆t

during the execution of an application. As shown in Section 5.3.3, when the clustering

with input parameter k (i.e., the number of clusters) is performed for a resource usage set

U = {u1, u2, · · · }, the subset Ui of resource usages that belong to the ith phase can be

written as:

Ui = {u|∀u ∈ phase i}, 1 ≤ i ≤ k. (5–2)

Resource reservation strategy : Phase-based resource reservation is performed. For intervals

whose resource usages belong to the ith phase, the local maximum amount of resource

usage Umaxi of the phase i is reserved:

Umaxi = max (u|∀u ∈ Ui) , 1 ≤ i ≤ k (5–3)

and the total resource reservation R over the whole execution period can be written as

R(k) =k∑

i=1

Umaxi × (size of Ui) , (5–4)

where k is the number of clusters used for clustering algorithm and the size of Ui is defined

as the number of elements of the subset Ui. Compared to the conservative reservation

strategy, which reserves the global maximum amount of resources over the whole execution

period, the phase-based reservation strategy can better adapt the resource reservation to

the actual resource usage and reduce the resource reservation cost as shown in Figure 5-2,

114

Time

Reso

urce

Allo

catio

nActual usage Phase-based Conservative

∆s1

∆s2

∆t1 ∆t2Figure 5-2. Resource allocation strategy comparison

Phase-based resource allocation strategy can adapt the time (∆t) and space(∆s) granularity of the allocation to the actual resource usage. It presents costreduction opportunity compared to the coarse-grained conservative strategy.

which illustrates the difference between the two reservation strategies using a hypothetical

workload.

Phase transition cost : The second factor for determining the number of phases is

the transition cost caused by switching between different phases. Define the transition

cost TR(k) as the number of transitions among k phases. The total cost TC(k) can be

calculated from the resource reservation R(k) and phase transition TR(k) as

TC(k) = C1R(k) + C2TR(k) (5–5)

where C1 and C2 denote the unit cost per resource usage and transition. The best number

of phases, kbest, should minimize the total cost. Therefore, kbest is derived as

kbest = arg min1≤k≤K

TC(k)

= arg min1≤k≤K

[R(k) + C × TR(k)] , (5–6)

115

where C denotes the transition factor, which is the ratio of C2 to C1, and K is the

maximum number of phases.

Encoding misprediction penalty cost : The algorithm can be extended to phase

prediction as well as phase analysis of resource usage. The determination of the best

number of phases remains the same, whereas the cost function has to be changed to

take over- or under-provisioning caused by prediction error into account. Generally the

mispredictions consist of two possible cases: over-provisioning and under-provisioning.

Over-provisioning refers to the cases that the resource reservation based on prediction is

larger than the actual usage. It guarantees that the application response time is equal

to or less than the time defined in the SLA. In this case, the penalty is the cost of the

over-reserved resource, which has been encoded in the cost model already. In case of

under-provisioning, the application’s execution time will be enlarged because of the

resource constrain. The performance degradation is approximated by the penalty in the

total cost function. The penalty is defined as the difference between the under-reserved

resource and the actual resource usage, and can be written as

Upenaltyi =

∑

∀u∈Ui

0 if u ≤ Umaxi

u− Umaxi if u > Umax

i

(5–7)

P (k) =k∑

i=1

Upenaltyi (5–8)

where k is the number of phases. Taking both the phase transition and misprediction costs

into account, the general total cost function is modified as

TC ′(k) = C1R(k) + C2TR(k) + C3P (k) (5–9)

116

where C1, C2, and C3 denote the unit cost per resource usage, switching, and penalty.

Therefore, k′best is derived as

k′best = arg min1≤k≤K

TC ′(k)

= arg min1≤k≤K

[R(k) + C × TR(k) + CpP (k)] , (5–10)

where C is the transition factor, Cp denote the discount factor for misprediction penalty,

which is the ratio of C3 to C1, and K is the maximum number of phases.

5.4 Phase Prediction

This section describes the work flow of the application resource demand phase

prediction illustrated in Figure 5-3. The prediction consists of two stages: a training

stage and a testing stage. During the training stage, the number of the clusters in

the application resource usage, the corresponding cluster centroids, and the unknown

parameters of the time series prediction model of the resource usage are determined.

During the testing stage, the one-step ahead resource usage is predicted and classified as

one of the clusters.

Both stages start from pattern representation and framing. In the step of pattern

representation, the collected performance data of the application VM are profiled to

extract only the features which will be used for clustering and future resource provisioning.

For example, in the one-dimension case discussed in this thesis, the training data of a

specific performance feature (X1×u, see Table 5-1), are extracted, where u is the total

number of input data. Then the extracted performance data X1×u are framed with the

prediction window size m to form data X′(u−m+1)×m.

The training stage mainly consists of two processes: prediction model fitting and

phase behavior analysis. The algorithms defined in Section 5.3.3 and 5.3.4 are used to

find out the number of phases k, which gives the lowest total resource provisioning cost.

The output phase profile is used to train the phase predictor. In addition, the unknown

parameters of the resource predictor are estimated from the training data. In this thesis,

117

a time-series prediction model, autoregressive (AR), is used for its simplicity and proven

success in computer system resource prediction [78]. However, this prototype can generally

work with any other time-series prediction models. In case of highly dynamic workloads,

the Learning-Aided Resource Predictor (LARPredictor) developed in Chapter 4 can be

used. The LARPredictor uses a mix-of-experts approach, which adaptively chooses the

best prediction model from a pool of models based on learning of the correlations between

the workload and fitted prediction models of historical runs.

Similar to the training stage, the testing data are extracted Y1×v and framed with

the prediction window size m. The framed testing data Y′(v−m+1)×m are used as input

of the fitted resource predictor to predict the future resource usage Y′1×v. The phase

predictor classifies the predicted resource usages Y′1×v into the phases P′

1×v based on the

phase profile learned in the training stage . Similarly, the phase predictions for the actual

resource usage Y1×v are performed to generate P1×v. Then the corresponding predicted

phases P′1×v (which are based on predicted resource usage) and P1×v (which are based on

actual resource usage) are compared to evaluate the phase prediction accuracy, which is

defined as the ratio of the number of matched phase predictions over the total number of

phase predictions.

5.5 Empirical Evaluation

We have implemented a prototype for the phase analysis and prediction model

including Perl and Shell scripts to extract and profile the performance data from

the performance database, and a Matlab implementation of the phase analyzer and

predictor. This section shows the experimental results of the phase analysis and prediction

performance evaluations using traces collected from the batch executions of SPECseis96,

a scientific benchmark program, and replay of the WorldCup98 web access log. In all

the experiments, ten-fold cross validation was performed for each set of time series

performance data.

118

Table 5-1. Performance feature list

Performance Features DescriptionCPU System / User Percent CPU System / UserBytes In / Out Number of bytes per second into / out of

the networkIO BI / BO Blocks sent to / received from block device

(blocks/s)Swap In / Out Amount of memory swapped in / out

from / to disk (kB/s)

5.5.1 Phase Behavior Analysis

This set of experiments illustrates how the cost model presented in Section 5.3.4 can

be used to find out the best number of clusters for an application workload. The Ganglia

monitoring daemon was used to collect the performance data of the application container.

Table 5-1 shows the list of performance features under study in the experiments.

5.5.1.1 SPECseis96 benchmark

In this experiment, the SPECseis96 benchmark, which is a CPU-intensive workload

representing a scientific application [53], was hosted by a VMware GSX virtual machine.

The host server of the virtual machine was an Intel(R) Xeon(TM) dual-CPU 1.80GHz

machine with 512KB cache and 1GB RAM. The Ganglia daemon was installed in the

guest VM and run to collect the resource performance data once every five seconds

(5secs/interval) and store them in the performance database. During feature represen-

tation, the data were extracted based on given VMID, FeatureID, starting and ending

time stamps to form time series data under study. Then subsequent phase analysis was

performed for the 8000 performance snapshots collected during the monitoring periods.

Figure A shows a sample set of training data of the CPU user (%) of the VM

including the actual resource usages (Actual Rsc), reserved resources based on the k-mean

clustering (k=3) (Rsvd Rsc) and based on the conservative reservation strategy (Consrv

Rsc). Figure B shows a sample set of the corresponding testing data including the actual

resource usage (Actual Rsc), the resource reservation based on actual resource usage (Rsvd

119

Rsc (Actual)), the predicted resource usage by the AR prediction (Predicted Rsc), and the

resource reservation based on the predicted usage (Rsvd Rsc (Predict)).

Figures C and D show that, with increasing number of phases, two of the deter-

minants in the cost model including the number of phase transitions TR(k) and the

misprediction penalty P(k) increase monotonically. The other determinant of the cost

model, the amount of reserved resources R(k), is shown by the lowest curve with index

C = 0 in Figure E. It indicates that with increasing number of phases the total reserved

resources of the training set is decreasing monotonically. This result is because with the

increasing number of phases, the resource allocation can be performed at time scales of

finer granularity. However, there is a diminishing return of the increased number of phases

because of the increasing phase transition costs and misprediction penalties.

In the first analysis, we assume each resource reservation scheme to be clairvoyant,

i.e., it reserves resources based on exact knowledge of future workload requirements. This

assumption eliminates the impact of inaccuracies introduced by the phase predictor.

In this case, Equation (5–6), which takes the resource reservation cost and the phase

transition cost into account while deciding the optimal number of phases, can be

applied as shown in Figure E. In this figure, the total cost over the whole testing period

is measured by CPU usage in percentage. The discount factor C denotes the CPU

percentage that will cost for each phase transition: C = CPU(%) ∗ TransitionDuration.

For example, the bottom line of C = 0 shows the case of no transition cost, which gives

the lower bound of the total cost. For another instance, C = 260% implies a 13-second

transition period (2.6intervals ∗ 5secs/interval) with the assumption of 100% CPU

consumption during the transition period. When the discount factor C increases from 0 to

260, the best number of phases k best, which can provide the lowest total cost, decreases

gradually from 10 to 2. The phase profile depicted in Figure E can be used to decide the

number of phases that should be used in the phase-based resource reservation to minimize

the total cost with given available transition options. For example, VMware ESX supports

120

on-line resource reprovisioning on the same cluster node. So the transition time can be

virtually close to zero (C = 0). In this case, 10 phases can be used. If the transition takes

8 seconds (C = 156), which is achievable with intra-cluster VM migration for resource

reprovisioning, four phases work the best. When the transition cost exceeds the level that

the reduced resource reservation can justify for the workload under study, the total cost is

an increasing function of the number of phases. In this case, it is better to fall back from

the phase-based resource reservation strategy to the conservative one.

The impact of inaccuracies introduced by the phase predictor is shown in Figure F. In

addition to the resource reservation costs and the phase transition costs, this experiment

also took the phase mis-prediction penalty costs into accounts while calculating the total

cost. For example, for each unit of down-size mis-predicted resource, a penalty of 8-times

(Cp = 8) of the unit resource cost is imposed. Comparing Figure E to Figure F, we can

see that adding penalty into the cost model will increase the final costs to the user for the

same set of k and C and potentially reduce the workload’s best number of phases k′ best

for the same set of C and Cp.

Finally a total cost ratio ρ is defined to be the relative total cost using k phases

TC ′(k) to the total cost of 1 phase TC ′(1).

ρ = TC ′(k)/TC(1). (5–11)

Intuitively, ρ measures the cost savings achieved using phase-based reservation strategy

over the conservative one. Thus, the smaller the value of ρ, the more efficient a

phase-based reservation scheme. Table 5-2 gives a sample total cost schedule (C = 52

and Cp = 8) for each of the eight performance features of SPECseis96. It shows that

by changing the resource provisioning strategy from the conservative approach (k = 1)

to the phase-based provisioning (k = 3), 29.5% total cost reduction for CPU usage can

be achieved. For spiky trace data such as disk I/O and memory usage, the total cost

reduction can be as high as 49%.

121

Table 5-2. SPECseis96 total cost ratio schedule for the eight performance features

1 2 3 4 5 6 7 8 9 10CPU_user 1.00 0.80 0.75 0.75 0.75 0.77 0.78 0.78 0.80 0.83CPU_system 1.00 0.67 0.66 0.65 0.64 0.66 0.67 0.69 0.70 0.71Bytes_in 1.00 0.97 0.96 0.96 0.96 0.96 0.96 0.95 0.95 0.95Bytes_out 1.00 0.95 0.90 0.88 0.90 0.87 0.87 0.87 0.87 0.87IO_BI 1.00 0.57 0.52 0.55 0.56 0.58 0.62 0.63 0.62 0.64IO_BO 1.00 0.57 0.53 0.55 0.57 0.61 0.60 0.61 0.64 0.63Swap_in 1.00 0.54 0.55 0.59 0.59 0.60 0.61 0.63 0.64 0.65Swap_out 1.00 0.51 0.47 0.49 0.54 0.55 0.57 0.58 0.59 0.61

Number of phases (k)PerformanceFeatures

(Total cost ratio ρ = TC ′(k)/TC ′(1), where C = 52 and Cp = 8)

5.5.1.2 World Cup web log replay

In this experiment, phase characterization was performed for the performance data

collected from a network-intensive application, 1998 World Cup web access log replay.

The workload used in this experiment was based on the 1998 World Cup trace [98].

The openly available trace containing a log of requests to Web servers was used as an

input to a client replay tool, which enabled us to exercise a realistic Web-based workload

and collect system-level performance metrics using Ganglia in the same manner that was

done for the SPECseis96 workload. For this study, we chose to replay the five hour (from

22:00:01 Jun.23 to 3:11:20 Jun.24) log of the least loaded server (serverID 101), which

contained 130,000 web requests.

The phase analysis and prediction techniques can be used to characterize performance

data collected from not only virtual machines but also physical machines. During the

experiment, a physical server with sixteen Intel(R) Xeon(TM) MP 3.00GHz CPUs

and 32GB memory was used to execute the replay clients to submit requests based on

submission intervals, HTTP protocol types (1.0 or 1.1), and document sizes defined in

the log file. A physical machine with Intel(R) Pentium(R) 4 1.70GHz CPU and 512MB

memory was used to host the Apache web server and a set of files which were created

based on the file sizes described in the log.

122

To perform the web log replay, a Matlab program was developed to profile the

binary access log file and extract the entries of the target web server. The “recreate”

tool provided by [98] was used to convert the binary log into the Common Log Format.

A modified version of the Real-Time Web Log Replayer [99] was used to analyze and

generate the files needed by the log replayer and perform the replay.

Figures 5-5 and 5-6 show the phase characterization results of the performance

features bytes in and bytes out of the web server. The interesting observation from Figures

A and B is that the number of phase transitions and mis-prediction penalties do not

always monotonically increase with the increasing number of phases. As a result, the

phase profile shown in Figure C argues that three-phase based resource provisioning gives

the lowest total cost with given C = [150k, 750k] and C p = 8. The results implies that

the phase profile is highly workload dependent. The prototype presented in this thesis can

help to construct and analyze the phase profile of the application resource consumption

and decide the proper resource provisioning strategy.

5.5.2 Phase Prediction Accuracy

As one of the cost determinant, the misprediction penalty is a function of the phase

prediction accuracy. This section evaluates the performance of phase prediction model

introduced in Section 5.4. A performance measurement, prediction accuracy, is defined as

the ratio of the number of performance snapshots, whose predicted phases match with the

observed phases, to the total number of performance snapshots collected during the testing

period.

Table 5-3 shows the phase prediction accuracies for the performance traces of the

main resources consumed by the SPECseis96 and the WorldCup98 workloads. Generally,

the phase prediction accuracy of each performance feature decreases with increasing

number of phases. It explains why the penalty curve rises monotonically with the

increasing number of phases in Figure D. With current implementation, an average of

95% accuracy can be achieved for the network performance traces of the WorldCup98 log

123

Table 5-3. Average phase prediction accuracy

1 2 3 4 5 6 7 8 9 10Bytes_in 1.00 0.99 0.99 0.98 0.98 0.97 0.97 0.96 0.96 0.96Bytes_out 1.00 0.94 0.94 0.92 0.91 0.89 0.87 0.88 0.86 0.84CPU_user 1.00 0.95 0.90 0.87 0.85 0.81 0.78 0.77 0.74 0.69CPU_system 1.00 0.94 0.87 0.83 0.83 0.79 0.76 0.74 0.73 0.69

WorldCup98

SPECseis96

PerformanceFeatures

Number of phases (k)Application

Table 5-4. Performance feature list of VM traces

Perf. Features DescriptionCPU Ready The percentage of time that the virtual machine

was ready but could not get scheduled to run ona physical CPU.

CPU Used The percentage of physical CPU resources usedby a virtual CPU.

Mem Size Current amount of memory in bytes the virtualmachine has.

Mem Swap Amount of swap space in bytes used by thevirtual machine.

Net RX/TX The number of packets and the MBytes persecond that are transmitted and received by a NIC.

Disk RD/WR The number of I/Os and KBytes per secondthat are read from and written to the disk.

replay and an average of 85% accuracy can be achieved for the CPU performance traces of

SPECseis96 for the four-phase cases.

In addition to the above two applications, we also evaluated the prediction per-

formance of the phase predictor using traces of a set of five virtual machines. These

virtual machines were hosted by a physical machine with an Intel(R) Xeon(TM) 2.0GHz

CPU, 4GB memory, and 36GB SCSI disk. VMware ESX server 2.5.2 was running on

the physical host. The vmkusage tool was run on the ESX server to collect the resource

performance data of the guest virtual machines every minute and store them in a round

robin database. The performance features under study in this experiment are shown in

Table 5-4.

124

In this experiment, a virtual machine (VM1 ), which hosts a web server, Globus

GRAM/MDS and GridFTP services, and a PBS head node, is used. Its trace data of a

7-day period with 30-minute intervals were extracted. During this period, a total of 310

jobs were executed varying with a mix of 93.55% short running jobs (1 to 2 seconds),

3.87% medium running jobs (2 to 10 minutes), and 2.58% long running jobs (45 to

50 minutes). In addition to the VM1, a set of 24-hour period with 5-minute intervals

performance traces of additional four virtual machines were evaluated as well: VM2,

which hosts a Linux-based port-forwarding proxy for VNC sessions, VM3, which hosts a

WindowsXP based calendar, VM4, which hosts a web server, a list server, and Wiki server,

and VM5, which hosts a web server.

Table 5-5 shows the average phase prediction accuracies for each of the 12 perfor-

mance features over all the five VMs. It shows that with increasing number of phases

the phase prediction accuracy of each performance feature decreases monotonically. The

prediction accuracies vary with the performance features under study. With current

implementation, an average of 83.25% accuracy can be achieved across the phase

predictions of all the twelve performance features for the two phase cases.

5.5.3 Discussion

In the phase analysis and prediction experiments, the following assumptions regarding

the components of the cost model are made:

1. A clear mapping between resource consumption and response time is assumed for

the application container. This might not always be true for all types of applications.

More complex performance/queuing models may be needed to provide an accurate

mapping in case of complex applications.

2. A dedicated machine is assumed for the application container to collect the

performance data. In case that multiple applications co-exist on the same hosting

machine, a more sophisticated method of data collection, for example aggregating

performance data of the processes that belong to the same application, may be needed.

125

Table 5-5. Average phase prediction accuracy of the five VMs

1 2 3 4 5 6 7 8 9 10CPU_Used 1.00 0.85 0.69 0.60 0.51 0.48 0.43 0.44 0.38 0.35CPU_Ready 1.00 0.81 0.67 0.52 0.45 0.36 0.36 0.32 0.33 0.32Mem_Size 1.00 0.91 0.84 0.71 0.70 0.59 0.57 0.52 0.50 0.48Mem_Swap 1.00 0.96 0.89 0.89 0.83 0.75 0.71 0.70 0.66 0.64NIC #1_RX 1.00 0.58 0.54 0.47 0.41 0.39 0.37 0.34 0.30 0.28NIC #1_TX 1.00 0.56 0.48 0.42 0.39 0.35 0.29 0.26 0.29 0.25NIC #2_RX 1.00 0.93 0.77 0.70 0.61 0.55 0.46 0.33 0.31 0.24NIC #2_TX 1.00 0.88 0.81 0.76 0.71 0.63 0.53 0.48 0.56 0.45Disk1_Read 1.00 0.97 0.92 0.86 0.80 0.73 0.64 0.56 0.52 0.44Disk1_Write 1.00 0.94 0.87 0.78 0.70 0.67 0.63 0.59 0.58 0.55Disk2_Read 1.00 0.67 0.61 0.55 0.50 0.49 0.47 0.46 0.41 0.38Disk2_Write 1.00 0.93 0.84 0.76 0.60 0.57 0.51 0.46 0.41 0.38

Number of PhasesPerformanceFeatures

3. In this work, one dimensional phase analysis and prediction is performed. However

the prototype can generally work for multi-dimension resource provisioning cases also.

For clustering in the multi-dimension space, additional pattern representation techniques

such as Principal Component Analysis (PCA) can be used to project the data to lower

dimensional space to reduce the computing intensity. In addition, the transition factor

C will represent the unit transition cost defined in the pricing schedule of the resource

provider.

Developing prediction models for parallel and multi-tier applications is part of our

future research.

5.6 Related Work

Recently, application’s phase behavior has drawn a growing research interest for

different reasons. First, tracking application phases enables workload dependent dynamic

management of power/performance trade-offs [100][101]. Second, phase characterization

that summarizes application behavior with representative execution regions can be used

126

to reduce the high computation costs of large-scale simulations [102][103]. Our purpose to

study the phase behavior is to support dynamic resource provisioning of the application

containers.

In addition to the purpose of study, our approach differs from traditional program

phase analysis in the following ways:

1) Performance metric under study: In the area of power management and simulation

optimization for computer architecture research, the metrics used for workload charac-

terization are typically Basic Block Vectors (BBV) [102][101], conditional branch counter

[104], and instruction working set [105]. In the context of application VM/container’s

resource provisioning, the metrics under study are the system level performance features,

which are instructive to VM resource provisioning such as those shown in Table 5-1.

2) Knowledge of the program codes: While [102][101][104] at least requires profiling

of program binary codes, our approach requires neither instrumentation nor access of

program codes.

3) This thesis answers the question “how many clusters are best” in the context of

system level resource provisioning.

In [106], Dhodapkar et al. compared three dynamic program phase detection

techniques discussed in [102], [104], and [105] using a variety of performance metrics, such

as sensitivity, stability, performance variance and correlations between phase detection

techniques.

In addition, other related work on resource provisioning include: Urgaonkar et al.

studied resource provisioning in a multi-tier web environment [107]. Wildstrom et al.

developed a method to identify the best CPU and memory configuration from a pool of

configurations for a specific workload [108]. Chase et al. have proposed a hierarchical

architecture that allocates virtual clusters to a group of applicaitons [109]. Kusic et al.

developed an optimization framework to decide the number of servers to allocate to

127

each cluster to maximize system revenue [110]. Tesauro et al. used a combination of

reinforcement learning and queuing model for system performance management [5].

5.7 Conclusion

The application resource demand phase analysis and prediction prototype presented

in this chapter shows how to apply statistical learning techniques to support on-demand

resource provisioning. This chapter shows how to define the phases in the context of

system level resource provisioning and provides an approach to automatically find out

the number of phases which can provide optimal cost. The proposed cost model can

take the resource cost, phase transition cost, and prediction accuracy into account. The

experimental results show that an average of above 90% of phase prediction accuracy can

be achieved in the experiments across the CPU and network performance features under

study for the four-phase cases. With the knowledge of the system level application phase

behavior, we envision dynamic optimization of resource scheduling during the application

run can be performed to improve system utilization and reduce the cost for the user.

Providing more informative phase prediction can help to achieve this goal and is part of

our future research.

128

mmu )1(X

v1P

mmv )1(Y

Resource Pattern

Framing

..……..…..Training Data

u1X

Testing Data

v1Y

Framing

K-means Model Fitting

Resource

Predictor

Phase

Predictor

Pattern Rep. Pattern Rep.

),( 11 kuCI

v1Y

v1P

Comparing

Phase Prediction Accuracy

Figure 5-3. Application resource demand phase prediction workflowIn the training stage, the u performance data X1×u of feature(s) used in thesubsequent phase analysis are extracted (pattern representation) and framedwith prediction window size m. The unknown parameters of the resourcepredictor is estimated during model fitting using the framed training dataX′

(u−m+1)×m. In addition, the clustering algorithms introduced in Section 5.3are used to construct the application phase profile including the phase labelsI1×u for all the samples and the calculated cluster centroids C1×k. In thetesting phase, the phase predictor uses the knowledge learned from the phaseprofile to predict the future phases P′1×v based on the predicted resourceusage Y′1×v and P1×v based on observed actual resource usage Y1×v, andcompare them to evaluate the phase prediction accuracy.

129

Time Index

Perc

en

tC

PU

user

Consrv Rsc

Rsvd Rsc

Actual Rsc

0 10 20 30 40 500

20

40

60

80

100

A

Time Index

Perc

en

tC

PU

use

r

Rsvd Rsc (Predict)

Predicted RscRsvd Rsc (Actual)

Actual Rsc

0 10 20 30 40 500

20

40

60

80

100

B

Figure 5-4. Phase analysis of SPECseis96 CPU user A) Sample training data B)Sampletesting data C)Phase transitions D)Misprediction penalties E)Total costwithout penalty F)Total cost with penalty (Cp = 8)

130

Number of Phases

Nu

mb

er

of

Tra

nsi

tio

ns

TR

(k)

0 2 4 6 8 100

200

400

600

800

1000

1200

1400

C

Number of Phases

Mis

pre

dic

tio

nP

enal

tyP

(k)

0 2 4 6 8 101

2

3

4

5

6

7

8×10

4

D

Figure 5-4. Continued

131

0 2 4 6 8 10

Number of Phases

To

tal

Co

stT

C(k

)

=R

ese

rved

Reso

urc

eR

(k)

+C

*T

ran

sit

TR

(k)

C=0

C=52C=104

C=156

C=208C=260

0 2 4 6 8 101

2

3

4

5

6×10

6

E

0 2 4 6 8 10

Number of Phases

To

tal

Co

stT

C(k

)

=R

ese

rved

Reso

urc

eR

(k)

+C

*T

ran

sit

TR

(k)

C=0

C=52C=104

C=156

C=208C=260

0 2 4 6 8 101

2

3

4

5

6×10

6

F

Figure 5-4. Continued

132

Number of Phases

Nu

mb

er

of

Tra

nsi

tio

ns

TR

(k)

0 2 4 6 8 100

100

200

300

400

A

Number of Phases

Mis

pre

dic

tio

nP

enal

tyP

(k)

0 2 4 6 8 101.35

1.4

1.45

1.5

1.55

1.6×10

6

B

0 2 4 6 8 10

Number of Phases

TC

’(k

)=

To

tal

Co

stT

C(k

)+

8*

Pen

alty

P(k

)

C=0C=33000C=66000C=99000C=132000C=165000

0 2 4 6 8 102

4

6

8

10

12×10

7

C

Figure 5-5. Phase analysis of WorldCup’98 Bytes In A)Phase transitions B)Mispredictionpenalties C)Total cost with penalty (Cp = 8)

133

Number of Phases

Nu

mb

er

of

Tra

nsi

tio

ns

TR

(k)

0 2 4 6 8 100

500

1000

1500

2000

A

Number of Phases

Mis

pre

dic

tio

nP

enal

tyP

(k)

0 2 4 6 8 100

2

4

6

8×10

6

B

Number of Phases

TC

’(k

)=

To

tal

Co

stT

C(k

)+

8*

Pen

alty

P(k

)

C=0C=150000C=300000C=450000C=600000C=750000

0 2 4 6 8 102

4

6

8

10

12

14

16×10

8

C

Figure 5-6. Phase analysis of WorldCup’98 Bytes out A)Phase transitions B)Mispredictionpenalties C)Total cost with penalty (Cp = 8)

134

CHAPTER 6CONCLUSION

Self-management has drawn increasing attentions in the last few years due to

the increasing size and complexity of computing systems. A resource scheduler that

can perform self-optimization and self-configuration can help to improve the system

throughput and free system administrators from labor-intensive and error-prone tasks.

However, it is challenging to equip a resource scheduler with such self- capacities because

of the dynamic nature of system performance and workload.

In this dissertation, we propose to use machine learning techniques to assist system

performance modeling and application workload characterization, which can provide

support for on-demand resource scheduling. In addition, virtual machines are used

as resource containers to host application executions for the ease of dynamic resource

provisioning and load balancing.

The application classification framework presented in Chapter 2 used the Principal

Component Analysis (PCA) to reduce the dimension of the performance data space.

Then the k-Nearest Neighbor k-NN algorithm is used to classify the data into different

classes such as CPU-intensive, I/O-intensive, memory-intensive, and network-intensive. It

does not require modifications of the application source code. Experiments with various

benchmark applications suggest that with the application class knowledge, a scheduler

can improve the system throughput 22.11% on average by allocating the applications of

different classes to share the system resources.

The feature selection prototype presented in Chapter 3 uses a probabilistic model

(Bayesian Network) to systematically select the representative performance features,

which can provide optimal classification accuracy and adapt to changing workloads. It

shows that autonomic feature selection enables classification without requiring expert

knowledge in the selection of relevant low-level performance metrics. This approach

requires no application source code modification nor execution intervention. Results from

135

experiments show that the proposed scheme can effectively select a performance metric

subset providing above 90% classification accuracy for a set of benchmark applications.

In addition to the application resource demand modeling, Chapter 4 proposes a

learning based adaptive predictor, which can be used to predict resource availability. It

uses the k-NN classifier and PCA to learn the relationship between workload characteristic

and suited predictor based on historical predictions, and to forecast the best predictor

for the workload under study. Then, only the selected best predictor is run to predict the

next value of the performance metric, instead of running multiple predictors in parallel

to identify the best one. The experimental results show that this learning-aided adaptive

resource predictor can often outperform the single best predictor in the pool without a

priori knowledge of which model best fits the data.

The application classification and the feature selection techniques can be used

to define the application resource consumption patterns at any given moment. The

experimental results of the application classification suggest that allocating applications

which have complementary resource consumption patterns to the same server can improve

the system throughput.

In addition to one-step-ahead performance prediction, Chapter 5 studied the large

scale behavior application resource consumption. Clustering based algorithms have

been explored to provide a mechanism to define and predict the phase behavior of the

application resource usage to support on-demand resource allocation. The experimental

results show that an average of above 90% of phase prediction accuracy can be achieved

for the four-phase cases of the benchmark workloads.

136

REFERENCES

[1] J. Kephart and D. Chess, “The vision of autonomic computing,” Computer, vol. 36,no. 1, pp. 41–50, 2003.

[2] Y. Yang and H. Casanova, “Rumr: Robust scheduling for divisible workloads.,” inProc. 12th High-Performance Distributed Computing, Seattle, WA, June 22-24, 2003,pp. 114–125.

[3] J. M. Schopf and F. Berman, “Stochastic scheduling,” in Proc. ACM/IEEEConference on Supercomputing, Portland, OR, Nov. 14–19, 1999, p. 48.

[4] L. Yang, J. M. Schopf, and I. Foster, “Conservative scheduling: Using predictedvariance to improve scheduling decisions in dynamic environments,” in Proc.ACM/IEEE conference on Supercomputing, Nov. 15-21, 2003, p. 31.

[5] G. Tesauro, N. Jong, R. Das, and M. Bennani, “A hybrid reinforcement learningapproach to autonomic resource allocation,” in Proc. IEEE International Conferenceon Autonomic Computing (ICAC’06), 2006, pp. 65–73.

[6] G. Tesauro, R. Das, W. Walsh, and J. Kephart, “Utility-function-driven resourceallocation in autonomic systems,” in Proc. Second International Conference onAutonomic Computing (ICAC’05), 2005, pp. 342–343.

[7] R. Duda, P. Hart, and D. Stork, The Art of Computer Systems PerformanceAnalysis: Techniques for Experimental Design, Measurement, Simulation, andModeling, Wiley-Interscience, New York, NY, Apr. 1991.

[8] J. O. Kephart, “Research challenges of autonomic computing,” in Proc. 27thInternational Conference on Software Engineering ICSE, May 2005, pp. 15–22.

[9] S. M. Weiss and C. A. Kulikowski, Computer Systems That Learn: Classificationand Prediction Methods from Statistics, Neural Nets, Machine Learning, and ExpertSystems, Morgan Kaufmann, San Mateo, CA 94403, 1990.

[10] R. P. Goldberg, “Survey of virtual machine research,” IEEE Computer Magazine,vol. 7, no. 6, pp. 34–45, June 1974.

[11] R. Figueiredo, P. Dinda, and J. Fortes, “A case for grid computing on virtualmachines,” in Proc. 23rd International Conference on Distributed ComputingSystems, May 19–22, 2003, pp. 550–559.

[12] S. Pinter, Y. Aridor, S. Shultz, and S. Guenender, “Improving machine virtualizationwith ’hotplug memory’,” Proc. 17th International Symposium on ComputerArchitecture and High Performance Computing, pp. 168–175, 2005.

[13] C. Clark, K. Fraser, S. Hand, J. Hanseny, E. July, C. Limpach, I. Pratt, andA. Warfield, “Live migration of virtual machines,” in Proc. 2nd Symposium onNetworked Systems Design & Implementation (NSDI’05), Boston, MA, 2005.

137

138

[14] “Vmotion,” http://www.vmware.com/products/vi/vc/vmotion.html.

[15] M. Zhao, J. Zhang, and R. Figueiredo, “Distributed file system support for virtualmachines in grid computing,” Proc. 13th International Symposium on High Perfor-mance Distributed Computing, pp. 202–211, 2004.

[16] I. Krsul, A. Ganguly, J. Zhang, J. Fortes, and R. Figueiredo, “Vmplants: Providingand managing virtual machine execution environments for grid computing,” in Proc.Supercomputing, Washington, DC, Nov. 6–12, 2004.

[17] J. Sugerman, G. Venkitachalan, and B. Lim, “Virtualizing i/o devices on vmwareworkstation’s hosted virtual machine monitor,” in Proc. USENIX Annual TechnicalConference, 2001.

[18] J. Dike, “A user-mode port of the linux kernel,” in Proc. 4th Annual Linux Showcaseand Conference, USENIX Association, Atlanta, GA, Oct. 2000.

[19] A. Sundararaj and P. Dinda, “Towards virtual networks for virtual machine gridcomputing,” in Proc. 3rd USENIX Virtual Machine Research and TechnologySymposium, May 2004.

[20] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, “Checkpoint andmigration of UNIX processes in the Condor distributed processing system,” Tech.Rep. UW-CS-TR-1346, University of Wisconsin - Madison Computer SciencesDepartment, Apr. 1997.

[21] A. Barak, O. Laden, and Y. Yarom, “The now mosix and its preemptive processmigration scheme,” Bulletin of the IEEE Technical Committee on Operating Systemsand Application Environments, vol. 7, no. 2, pp. 5–11, 1995.

[22] R. Duda, P. Hart, and D. Stork, Pattern Classification, Wiley-Interscience, NewYork, NY, 2001, 2nd edition.

[23] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning,” ArtificialIntellegence Review, vol. 11, no. 1-5, pp. 11–73, 1997.

[24] S. Adabala, V. Chadha, P. Chawla, R. J. O. Figueiredo, J. A. B. Fortes, I. Krsul,A. M. Matsunaga, M. O. Tsugawa, J. Zhang, M. Zhao, L. Zhu, and X. Zhu, “Fromvirtualized resources to virtual computing grids: the in-vigo system.,” FutureGeneration Comp. Syst., vol. 21, no. 6, pp. 896–909, 2005.

[25] L. Yu and H. Liu, “Efficient feature selection via analysis of relevance andredundancy,” Journal of Machine Learning Research, vol. 5, pp. 1205–1224,Oct. 2004.

[26] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf.Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.

139

[27] M. L. Massie, B. N. Chun, and D. E. Culler, “The ganglia distributed monitoringsystem: Design, implementation, and experience.,” Parallel Computing, vol. 30, no.5-6, pp. 817–840, 2004.

[28] “Netapp,” http://www.netapp.com/tech library/3022.html.

[29] R. Eigenmann and S. Hassanzadeh, “Benchmarking with real industrial applications:the spec high-performance group,” IEEE Computational Science and Engineering,vol. 3, no. 1, pp. 18–23, 1996.

[30] “Ettcp,” http://sourceforge.net/projects/ettcp/.

[34] Q. Snell, A. Mikler, and J. Gustafson, “Netpipe: A network protocol independentperformace evaluator,” June 1996.

[31] “Simplescalar,” http://www.cs.wisc.edu/ mscalar/simplescalar.html.

[32] “Ch3d,” http://users.coastal.ufl.edu/ pete/CH3D/ch3d.html.

[33] “Bonnie,” http://www.textuality.com/bonnie/.

[35] “Vmd,” http://www.ks.uiuc.edu/Research/vmd/.

[36] “Spim,” http://www.cs.wisc.edu/ larus/spim.html.

[37] “Reference of stream,” http://www.cs.virginia.edu/stream/ref.html.

[38] “Autobench,” http://www.xenoclast.org/autobench/.

[39] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” J.Mach. Learn. Res., vol. 3, pp. 1157–1182, Mar. 2003.

[40] Y. Liao and V. R. Vemuri, “Using text categorization techniques for intrusiondetection,” in 11th USENIX Security Symposium, San Francisco, CA, Aug. 5–9,2002, pp. 51–59.

[41] A. K. Ghosh, A. Schwartzbard, and M. Schatz, “Learning program behavior profilesfor intrusion detection,” in Proc. the Workshop on Intrusion Detection and NetworkMonitoring, Santa Clara, CA, Apr. 9–12, 1999, pp. 51–62.

[42] M. Almgren and E. Jonsson, “Using active learning in intrusion detection,” in Proc.17th IEEE Computer Security Foundations Workshop, June 28–30, 2004, pp. 88–98.

[43] S. C. Lee and D. V. Heinbuch, “Training a neural-network based intrusion detectorto recognize novel attacks.,” IEEE Transactions on Systems, Man, and Cybernetics,Part A, vol. 31, no. 4, pp. 294–299, 2001.

[44] G. Forman, “An extensive empirical study of feature selection metrics for textclassification,” J. Mach. Learn. Res., vol. 3, pp. 1289–1305, 2003.

140

[45] N. H. Kapadia, J. A. B. Fortes, and C. E. Brodley, “Predictive application-performance modeling in a computational grid environment,” in Proc. 8th IEEEInternational Symposium on High Performance Distributed Computing, RedondoBeach, CA, Aug. 3–6, 1999, p. 6.

[46] J. Basney and M. Livny, “Improving goodput by coscheduling cpu and networkcapacity,” Int. J. High Perform. Comput. Appl., vol. 13, no. 3, pp. 220–230, Aug.1999.

[47] R. Raman, M. Livny, and M. Solomon, “Policy driven heterogeneous resourceco-allocation with gangmatching,” in Proc. 12th IEEE International Symposiumon High Performance Distributed Computing (HPDC’03), Seattle, WA, June 22–24,2003, p. 80.

[48] S. Sodhi and J. Subhlok, “Skeleton based performance prediction on sharednetworks,” in IEEE International Symposium on Cluster Computing and the Grid(CCGrid 2004), 2004, pp. 723– 730.

[49] V. Taylor, X. Wu, and R. Stevens, “Prophesy: an infrastructure for performanceanalysis and modeling of parallel and grid applications,” SIGMETRICS Perform.Eval. Rev., vol. 30, no. 4, pp. 13–18, 2003.

[50] O. Y. Nickolayev, P. C. Roth, and D. A. Reed, “Real-time statistical clustering forevent trace reduction,” The International Journal of Supercomputer Applicationsand High Performance Computing, vol. 11, no. 2, pp. 144–159, Summer 1997.

[51] D. H. Ahn and J. S. Vetter, “Scalable analysis techniques for microprocessorperformance counter metrics,” in Proc. SuperComputing, Baltimore, MD, Nov.16–22, 2002, pp. 1–16.

[52] I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, and J. Symons, “Correlatinginstrumentation data to system states: A building block for automated diagnosisand control.,” in 6th USENIX Symposium on Operating Systems Design andImplementation, 2004, pp. 231–244.

[53] J. Zhang and R. Figueiredo, “Application classification through monitoring andlearning of resource consumption patterns,” in Proc. 20th International Parallel &Distributed Processing Symposium, Rhodes Island, Greece, Apr. 25–29, 2006.

[54] M. Massie, B. Chun, and D. Culler, The Ganglia Distributed Monitoring System:Design, Implementation, and Experience, Addison-Wesley, Reading, MA, 2003.

[55] S. Agarwala, C. Poellabauer, J. Kong, K. Schwan, and M. Wolf, “Resource-awarestream management with the customizable dproc distributed monitoringmechanisms,” in Proc. 12th IEEE International Symposium on High PerformanceDistributed Computing, June 22–24, 2003, pp. 250–259.

[56] “Hp,” http://www.managementsoftware.hp.com.

141

[57] H. Liu and L. Yu, “Toward integrating feature selection algorithms for classificationand clustering,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 4, pp. 491–502, Apr.2005.

[58] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference, Morgan Kaufmann Publishers, San Francisco, CA, 1988.

[59] T. Dean, K. Basye, R. Chekaluk, S. Hyun, M. Lejter, and M. Randazza, “Copingwith uncertainty in a control system for navigation and exploration.,” in Proc. 8thNational Conference on Artificial Intelligence, Boston, MA, July 29–Aug. 3, 1990,pp. 1010–1015.

[60] D. Heckerman, “Probabilistic similarity networks,” Tech. Rep., Depts. of ComputerScience and Medicine, Stanford University, 1990.

[61] D. J. Spiegelhalter, R. C. Franklin, and K. Bull, “Assessment criticism andimprovement of imprecise subjective probabilities for a medical expert system,”in Proc. Fifth Workshop on Uncertainty in Artificial Intelligence, 1989, pp. 335–342.

[62] E. Charniak and D. McDermott, Introduction to Artificial Intelligence,Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1985.

[63] T. S. Levitt, J. Mullin, and T. O. Binford, “Model-based influence diagrams formachine vision,” in Proc. 5th Workshop on Uncertainty in Artificial Intelligence,1989, pp. 233–244.

[64] R. E. Neapolitan, Probabilistic reasoning in expert systems: theory and algorithms,John Wiley & Sons, Inc., New York, NY, USA, 1990.

[65] K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large marginnearest neighbor classification,” in Proc. 19th annual Conference on NeuralInformation Processing Systems, Vancouver, CA, Dec. 2005.

[66] R. Kohavi and F. Provost, “Glossary of terms,” Machine Learning, vol. 30, pp.271–274, 1998.

[67] B. Ziebart, D. Roth, R. Campbell, and A. Dey, “Automated and adaptive thresholdsetting: Enabling technology for autonomy and self-management,” in Proc. 2ndInternational Conference of Autonomic Computing, June 13–16, 2005, pp. 204–215.

[68] P. Mitra, C. Murthy, and S. Pal, “Unsupervised feature selection using featuresimilarity,” IEEE Trans. Pat. Anal. Mach. Intel., vol. 24, no. 3, pp. 301–312, Mar.2002.

[69] W. Lee, S. J. Stolfo, and K. W. Mok, “Adaptive intrusion detection: A data miningapproach,” Artificial Intelligence Review, vol. 14, no. 6, pp. 533–567, 2000.

[70] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen,“Performance debugging for distributed systems of black boxes,” in Proc. 19th ACM

142

symposium on Operating systems principles, Bolton Landing, NY, Oct. 19–22, 2003,pp. 74–89.

[71] R. Isaacs and P. Barham, “Performance analysis in loosely-coupled distributedsystems,” in Proc. 7th CaberNet Radicals Workshop, Bertinoro, Italy, Oct. 2002.

[72] I. Foster, “The anatomy of the grid: enabling scalable virtual organizations,” inProc. 1st IEEE/ACM International Symposium on Cluster Computing and the Grid,2001, pp. 6–7.

[73] R. Wolski, “Dynamically forecasting network performance using the network weatherservice,” in Journal of cluster computing, 1998.

[74] I. Matsuba, H. Suyari, S. Weon, and D. Sato, “Practical chaos time series analysiswith financial applications,” in Proc. 5th International Conference on SignalProcessing, Beijing, 2000, vol. 1, pp. 265–271.

[75] P. Magni and R. Bellazzi, “A stochastic model to assess the variability of bloodglucose time series in diabetic patients self-monitoring,” IEEE Trans. Biomed. Eng.,vol. 53, no. 6, pp. 977–985, 2006.

[76] K. Didan and A. Huete, “Analysis of the global vegetation dynamic metrics usingmodis vegetation index and land cover products,” in IEEE International Geoscienceand Remote Sensing Symposium (IGARSS’04), 2004, vol. 3, pp. 2058–2061.

[77] P. Dinda, “The statistical properties of host load,” Scientific Programming, , no.7:3-4, 1999.

[78] P. Dinda, “Host load prediction using linear models,” Cluster Computing, vol. 3, no.4, 2000.

[79] Y. Zhang, W. Sun, and Y. Inoguchi, “CPU load predictions on the computationalgrid *,” in Proc. 6th IEEE International Symposium on Cluster Computing and theGrid, May 2006, vol. 1, pp. 321–326.

[80] J. Liang, K. Nahrstedt, and Y. Zhou, “Adaptive multi-resource prediction indistributed resource sharing environment,” in Proc. IEEE International Symposiumon Cluster Computing and the Grid, 2004, pp. 293–300.

[81] S. Vazhkudai and J. Schopf, “Predicting sporadic grid data transfers,” Proc.International Symposium on High Performance Distributed Computing, pp. 188–196,2002.

[82] S. Vazhkudai, J. Schopf, and I. Foster, “Using disk throughput data in predictionsof end-to-end grid data transfers,” in Proc. 3rd International Workshop on GridComputing, Nov. 2002.

143

[83] S. Gunter and H. Bunke, “An evaluation of ensemble methods in handwritten wordrecognition based on feature selection,” in Proc. 17th International Conference onPattern Recognition, Aug. 2004, vol. 1, pp. 388–392.

[84] G. Jain, A. Ginwala, and Y. Aslandogan, “An approach to text classification usingdimensionality reduction and combination of classifiers,” in Proc. IEEE InternationalConference on Information Reuse and Integration, Nov. 2004, pp. 564–569.

[85] V. white paper, “Comparing the mui, virtualcenter, and vmkusage,” .

[86] J. D. Cryer, Time series analysis, Duxbury Press, Boston, MA, 1986.

[87] S. G. John O.Rawlings and D. A.Dickey, Applied Regression Analysis, Springer,2001.

[88] R. T. Trevor Hastie and J. Friedman, The Elements of Statistical Learning, Springer,2001.

[89] E. Bingham and H. Mannila, “Random projection in dimensionality reduction:applications to image and text data,” in Knowledge Discovery and Data Mining,2001, pp. 245–250.

[90] L. Sirovich and R. Everson, “Management and analysis of large scientific datasets,”Int. Journal of Supercomputer Applications, vol. 6, no. 1, pp. 50–68, 1992.

[91] J. Yang, Y. Zhang and B. Kisiel, “A scalability analysis of classifiers in textcategorization,” in ACM SIGIR’03, 2003, pp. 96–103.

[92] F. Friedman, J.H. Baskett and L. Shustek, “An algorithm for finding nearestneighbors,” IEEE Transactions on Computers, vol. C-24, no. 10, pp. 1000–1006, Oct.1975.

[93] J. Friedman, J.H. Bentley and R. Finkel, “An algorithm for finding best matches inlogarithmic expected time,” ACM Transactions on Mathematical Software, vol. 3,pp. 209–226, 1977.

[94] P. D. G. Banga and J. Mogul, “Resource containers: A new facility for resourcemanagement in server systems,” in Proc. 3rd symposium on Operating SystemDesign and Implementation, New Orleans, Feb. 1999.

[95] L. Ramakrishnan, L. Grit, A. Iamnitchi, D. Irwin, A. Yumerefendi, and J. Chase,“Towards a doctrine of containment: Grid hosting with adaptive resource control,”in Proc. Supercomputing, Tampa, FL, Nov. 2006.

[96] R. Dubes, “How many clusters are best? -an experiment,” Pattern Recogn., vol. 20,no. 6, pp. 645–663, Nov. 1987.

[97] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACMComputing Surveys, vol. 31, no. 3, pp. 264–323, 1999.

144

[98] “Worldcup98,” http://ita.ee.lbl.gov/html/contrib/WorldCup.html.

[99] “Logreplayer,” http://www.cs.virginia.edu/ rz5b/software/logreplayer-manual.htm.

[100] C. Isci, A. Buyuktosunoglu, and M. Martonosi, “Long-term workload phases:duration predictions and applications to dvfs,” IEEE Micro, vol. 25, no. 5, pp.39–51, 2005.

[101] C. Isci and M. Martonosi, “Phase characterization for power: evaluatingcontrol-flow-based and event-counter-based techniques,” Proc. 12th InternationalSymposium on High-Performance Computer Architecture, pp. 121–132, 2006.

[102] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automaticallycharacterizing large scale program behavior,” in Proc. 10th International Con-ference on Architectural Support for Programming Languages and Operating Systems,2002, pp. 45–57.

[103] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi,“Pinpointing representative portions of large intel itanium programs with dynamicinstrumentation,” in Proc. 37th annual international symposium on Microarchitec-ture, 2004.

[104] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas,“Memory hierarchy reconfiguration for energy and performance in general purposearchitectures,” in Proc. 33rd annual international symposium on microarchitecture,Dec. 2000, pp. 245–257.

[105] A. Dhodapkar and J. Smith, “Managing multi-configuration hardware via dynamicworking set analysis,” in Proc. 29th Annual International Symposium on ComputerArchitecture, Anchorage, AK, May 2002, pp. 233–244.

[106] A. Dhodapkar and J. Smith, “Comparing program phase detection techniques,” inProc. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003,pp. 217–227.

[107] B. Urgaonkar, P. Shenoy, A. Chandra, and P. Goyal, “Dynamic provisioningof multi-tier internet applications,” in Proc. 2nd International Conference ofAutonomic Computing, June 2005, pp. 217–228.

[108] J. Wildstrom, P. Stone, E. Witchel, R. J. Mooney, and M. Dahlin, “Towardsself-configuring hardware for distributed computer systems,” in Proc. 2nd Interna-tional Conference of Autonomic Computing, June 2005, pp. 241–249.

[109] J. S. Chase, D. E. Irwin, L. E. Grit, J. D. Moore, and S. E. Sprenkle, “Dynamicvirtual clusters in a grid site manager,” Proc. 12th IEEE International Symposiumon High Performance Distributed Computing, pp. 90–100, June 2003.

145

[110] D. Kusic and N. Kandasamy, “Risk-aware limited lookahead control for dynamicresource provisioning in enterprise computing systems,” in Proc. 3rd InternationalConference of Autonomic Computing, 2006, pp. 74–83.

BIOGRAPHICAL SKETCH

Jian Zhang was born in Chengdu, China. She received her B.S. degree in 1995, from

the University of Electronic Science and Technology of China, majoring in computer

communication. She received her M.S. degree in 2001 from the University of Florida,

majoring in electrical and computer engineering. Since 2002, she has been with the

Advanced Computing and Information Systems Laboratory (ACIS) at the University of

Florida, pursuing her Ph.D. degree. Her research interests include distributed systems,

autonomic computing, virtualization technologies, and information systems.

146

Documents

LEARNING-AIDED SYSTEM PERFORMANCE MODELING IN …ufdcimages.uflib.ufl.edu/UF/E0/02/17/38/00001/zhang_j.pdfLEARNING-AIDED SYSTEM PERFORMANCE MODELING IN SUPPORT OF SELF-OPTIMIZED RESOURCE