Download pdf - Adaptive Cache Aware Multiprocessor Scheduling …eprints.qut.edu.au/47602/1/Huseyin_Arslan_Thesis.pdf · Adaptive Cache Aware Multiprocessor Scheduling Framework (Research Masters)

Adaptive Cache Aware Multiprocessor Scheduling

Framework(Research Masters)

A THESIS SUBMITTED TO

THE FACULTY OF SCIENCE AND TECHNOLOGY

OF QUEENSLAND UNIVERSITY OF TECHNOLOGY

IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF

RESEARCH MASTER

Huseyin Gokseli Arslan

Faculty of Science and Technology

Queensland University of Technology

September 2011

Copyright in Relation to This Thesis

c© Copyright 2011 by Huseyin Gokseli Arslan. All rights reserved.

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements for an

award at this or any other higher education institution. To the best of my knowledge and belief,

the thesis contains no material previously published or written by another person except where

due reference is made.

Signature:

Date:

i

ii

This thesis is dedicated to my dearest family and my beloved one

for their love, endless support.

iii

iv

Abstract

Computer resource allocation represents a significant challenge particularly for multiprocessor

systems, which consist of shared computing resources to be allocated among co-runner pro-

cesses and threads. While an efficient resource allocation would result in a highly efficient

and stable overall multiprocessor system and individual thread performance, ineffective poor

resource allocation causes significant performance bottlenecks even for the system with high

computing resources. This thesis proposes a cache aware adaptive closed loop scheduling

framework as an efficient resource allocation strategy for the highly dynamic resource manage-

ment problem, which requires instant estimation of highly uncertain and unpredictable resource

patterns.

Many different approaches to this highly dynamic resource allocation problem have been

developed but neither the dynamic nature nor the time-varying and uncertain characteristics of

the resource allocation problem is well considered. These approaches facilitate either static and

dynamic optimization methods or advanced scheduling algorithms such as the Proportional Fair

(PFair) scheduling algorithm. Some of these approaches, which consider the dynamic nature

of multiprocessor systems, apply only a basic closed loop system; hence, they fail to take the

time-varying and uncertainty of the system into account. Therefore, further research into the

multiprocessor resource allocation is required.

Our closed loop cache aware adaptive scheduling framework takes the resource availability

and the resource usage patterns into account by measuring time-varying factors such as cache

miss counts, stalls and instruction counts. More specifically, the cache usage pattern of the

thread is identified using QR recursive least square algorithm (RLS) and cache miss count

time series statistics. For the identified cache resource dynamics, our closed loop cache aware

adaptive scheduling framework enforces instruction fairness for the threads. Fairness in the

v

context of our research project is defined as a resource allocation equity, which reduces co-

runner thread dependence in a shared resource environment. In this way, instruction count

degradation due to shared cache resource conflicts is overcome.

In this respect, our closed loop cache aware adaptive scheduling framework contributes to

the research field in two major and three minor aspects. The two major contributions lead

to the cache aware scheduling system. The first major contribution is the development of

the execution fairness algorithm, which degrades the co-runner cache impact on the thread

performance. The second contribution is the development of relevant mathematical models,

such as thread execution pattern and cache access pattern models, which in fact formulate the

execution fairness algorithm in terms of mathematical quantities.

Following the development of the cache aware scheduling system, our adaptive self-tuning

control framework is constructed to add an adaptive closed loop aspect to the cache aware

scheduling system. This control framework in fact consists of two main components: the

parameter estimator, and the controller design module. The first minor contribution is the

development of the parameter estimators; the QR Recursive Least Square(RLS) algorithm is

applied into our closed loop cache aware adaptive scheduling framework to estimate highly

uncertain and time-varying cache resource patterns of threads. The second minor contribution

is the designing of a controller design module; the algebraic controller design algorithm, Pole

Placement, is utilized to design the relevant controller, which is able to provide desired time-

varying control action. The adaptive self-tuning control framework and cache aware scheduling

system in fact constitute our final framework, closed loop cache aware adaptive scheduling

framework. The third minor contribution is to validate this cache aware adaptive closed loop

scheduling framework efficiency in overwhelming the co-runner cache dependency. The time-

series statistical counters are developed for M-Sim Multi-Core Simulator; and the theoretical

findings and mathematical formulations are applied as MATLAB m-file software codes. In this

way, the overall framework is tested and experiment outcomes are analyzed. According to our

experiment outcomes, it is concluded that our closed loop cache aware adaptive scheduling

framework successfully drives co-runner cache dependent thread instruction count to co-runner

independent instruction count with an error margin up to 25% in case cache is highly utilized.

In addition, thread cache access pattern is also estimated with 75% accuracy.

vi

Keywords

Multiprocessor Scheduling, Adaptive Control Theory, Recursive Least Square, Cache-Aware

Adaptive Scheduling Framework

vii

viii

Acknowledgments

I gratefully acknowledge the contributions of my principal supervisor, Assoc. Prof Glen Tian

and associate supervisor Dr. Ross Hayward and Queensland University of Technology.

ix

x

Table of Contents

Abstract v

Keywords vii

Acknowledgments ix

Nomenclature xv

List of Figures xxiv

List of Tables xxv

1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Scope and Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Literature Review 13

2.1 Multi-Core Chip Level Multiprocessor System Architecture . . . . . . . . . . . 13

2.1.1 Core Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.2 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1.3 Core Diversity and Parallelism . . . . . . . . . . . . . . . . . . . . . . 17

xi

2.2 Cache Architecture and Policies . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 Cache Performance Indicator: Cache Miss . . . . . . . . . . . . . . . 21

2.3 Multiprocessor Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.1 Multiprocessor Scheduling Taxonomy . . . . . . . . . . . . . . . . . . 23

2.3.2 Real-Time Multi-Core Multiprocessor Scheduling Algorithms . . . . . 25

2.4 Cache-Aware Multi-Core Chip Level Multiprocessor Scheduling . . . . . . . . 27

2.4.1 Cache-Fair Multi-Core CMP Scheduling . . . . . . . . . . . . . . . . 27

2.4.2 Adaptive Cache-Aware CMP Scheduling . . . . . . . . . . . . . . . . 40

2.5 Modern Control Theory for Scheduling Problems . . . . . . . . . . . . . . . . 47

2.5.1 Adaptive Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.5.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.5.3 Control System Design Algorithms . . . . . . . . . . . . . . . . . . . 53

3 System Model Adaptive Control 57

3.1 Theoretical Background of Dynamic System Model . . . . . . . . . . . . . . . 57

3.1.1 State-Space Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.1.2 Adaptive Control Theory . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2 Development of Thread Execution Pattern Model . . . . . . . . . . . . . . . . 64

3.3 Control Framework Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.4 Adaptive Control Framework Development . . . . . . . . . . . . . . . . . . . 71

3.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Parameter Estimation in Adaptive Control 75

4.1 Theoretical Background of Parameter Estimation . . . . . . . . . . . . . . . . 75

4.2 Least Square Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 78

4.3 Adaptive Weighted QR Recursive Least Square Algorithms . . . . . . . . . . . 79

4.3.1 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . 79

xii

4.3.2 Formulation and Theoretical Conclusions . . . . . . . . . . . . . . . . 81

4.3.3 Complexity Analysis of QR-RLS Algorithm . . . . . . . . . . . . . . 92


5 Algebraic Controller Design Methods 93

5.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Theoretical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.1 Deadbeat Controller Design . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.2 Pole Placement Controller Design . . . . . . . . . . . . . . . . . . . . 102


6 Experimental Setup and Simulation 113

6.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.2 Experiment Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.2.1 Development of Experiment Constraints . . . . . . . . . . . . . . . . . 115

6.2.2 Design Strategy:Two Stage Experiment . . . . . . . . . . . . . . . . . 117

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.3.2 Experiment Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.3.3 Analysis and Evaluation of Experiments . . . . . . . . . . . . . . . . . 122


7 Conclusions and Recommendations 137

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.2 Future Work and Recommendations . . . . . . . . . . . . . . . . . . . . . . . 138

7.2.1 Heterogeneous Multiprocessor Architecture Resource Allocation Problem138

7.2.2 Statistical Pre-processing of Real-Time Statistical Information . . . . . 139

7.2.3 Robust Adaptive Control Theory . . . . . . . . . . . . . . . . . . . . . 139

7.2.4 Theoretical Analysis of the Scheduling Framework . . . . . . . . . . . 140

xiii

Bibliography 141

xiv

Nomenclature

Abbreviations

ALU Arithmetic Logic Unit

AppC Application Controller

APPC Adaptive Pole Placement

ATD Auxiliary Tag Directory

CMP Chip Level Multiprocessor

CMR Co-runner Miss Rate

CP Cache Miss Penalty

CPI Cycle Per Instruction

BQ Bus Queue

BQD Buffer Queue Delay

DARMA Dynamic Auto Regressive Moving Average

DBQ Data Bus Queue

DM Deadline-Monotonic (Scheduling)

DMHA Dynamic Miss Handling Architecture

EDF Earliest-Deadline-First (Scheduling)

ER-PF Early Release Proportional Fair (Scheduling)

ER-PD Early Release Predictive Deadline (Scheduling)

FAMHA Fair Adaptive Miss Handling Architecture

FCP Finite Cache Penalty

FIFO First in First Out

xv

FP Fixed Priority

FS Fair Speedup

FSI Fair Speedup Improvement

IC Instruction Count

ILP Instruction Level Parallelism

IPC Inter Processor Communication

IPs Interference Points

LFU Least Frequently Used (Scheduling)

LLF Least Laxity First (Scheduling)

LRU Least Recently Used (Scheduling)

LS Least Square

LTI Linear Time Invariant

LTV Linear Time Varying

MAD Memory Access Delay

MHA Miss Handling Architecture

MIMO Multiple Input Multiple Output

MLP Memory Level Parallelism

MSHR Miss Status Holding Register

MR Miss Rate

MRC Miss Rate Curves

MRU Most Recently Used

NUMA Non uniform Memory Access

PAN Pre-Actuation Negotiator

PD Pseudo Deadline (Scheduling)

PF Proportional Fair (Scheduling)

Pfair Proportional Fair (Scheduling)

PI Proportional-Integral (Control/Controller)

xvi

PID Proportional-Integral-Derivative (Control/Controller)

RBQ Request Bus Queue

RLS Recursive Least Square

RM Rate-Monotonic (Scheduling)

SD Service Differentiation

SHARP Shared Resource Partitioning

SIMD Single Instruction Multiple Data

SISO Single Input Single Output

SMP Symmetric Multiprocessor

SMT Simultaneous Multithreading

SPM Static Parametric Model

UMA Uniform Memory Access

VLIW Very Long Instruction Word

VT Vertical Threading

Symbols Chapter

A,B,C System State Space Model Coefficient Matrices Ch3

A System Plant Denominator Polynomial Ch5

ai ith coefficient term of A Ch5

Ac Desired Closed Loop Characteristic Polynomial Ch5

Ay Maximum Overshoot (Closed Loop Characteristic) Ch5

B System Plant Numerator Polynomial Ch5

bi ith coefficient term of B Ch5

C Controller Ch3

CMC Co-Runner Miss Count Vector Ch3,5

cmc Co-Runner Miss Count Ch3,5

DGDes Desired Closed Loop Denominator Polynomial Ch5

xvii

DIC ICCacheDedicated Denominator Polynomial Ch5

E Error Tracking Function Ch5

fe(x, θ) Prediction Error Distribution Ch4

ICCacheDedicated Cache Dedicated Closed Loop System Ch5

ICError Instruction Count Error Ch3

G System Plant Transfer Function Ch3

GDesired Desired Closed Loop Transfer Function Ch5

G.M Gain Margin Ch5

GP Guaranteed Percentage Ch2

linf Infinity Norm Ch3

l2 Euclidean Norm Ch3

L(q) Linear Filter Ch4

M∗ System Model Set Ch4

M(θ) System Model Set Element Ch4

M Fairness Metric Ch2

m Normalization Signal Ch3

MC Miss Count Vector Ch3

mc Miss Count Ch3

mc(k) Miss Count Estimate Ch4

mc(k) Miss Count Estimate Vector Ch4

mcq(k) Triangularized Miss Count Vector Ch4

MPR Miss Prediction Rate Ch2

Mperf Performance Fairness Metric Ch2

MMiss Cache Fairness Metric Ch2

NGDes Desired Closed Loop Numerator Polynomial Ch5

NIC ICCache−Dedicated Numerator Polynomial Ch5

Q(k) Givens Rotation Matrix Ch4

xviii

Qθi(k) Givens Rotation Matrices with Rotation Angle θi Ch4

Q Controller Transfer Function Numerator Polynomial Ch5

qi ith coefficient term of Q Ch5

P Controller Transfer Function Denominator Polynomial Ch5

pi ith coefficient term of P Ch5

P outi Performance Metric Ch2

P refi Performance Reference Metric Ch2

R Autocorrelation Matrix Ch4

si Continuous Time Roots of Polynomial Ch5

Tded Execution Time Dedicated Cache Environment Ch2

Tovl Overlap Operation Cycles Ch2

To Sampling Period Ch5

Tpri Private Operation Cycles Ch2

Tshr Execution Time Shared Cache Environment Ch2

Tvul Vulnerable Operation Cycles Ch2

tr trace Ch4

u Control Input Ch3, Ch5

uc Input Command (Closed Loop) Ch5

U(k) Triangularized Input Data Matrix Ch4

V [k] Random Noise Component Ch3

VN(θ, ZN) Norm or Criterion Cost Function Ch4

wi Requested Cache Ways Ch2

W Available Cache Ways Ch2

W Parameter Weight Vector Ch4

W [k] Random Noise Component Ch3

wn Normalized Frequency Ch5

w(k) Plant Parameter Vector (QR RLS Algorithm) Ch4

xix

w Plant Parameter Coefficients Ch4

x State Space Variable Ch2,3

x State Space Vector Ch2,3

x State Space Variable Estimate Ch2,3

y System Output Ch3,4,5,6

y(t|θ) Prediction Ch4

zi Discrete Time Polynomial Roots Ch5

δCPUCY CLE Additional CPU Cycles Ch3

‖‖∞ Infinity Norm Ch3

Greek Letters

ϕi(∞) Ideal Instruction Per Cycle (wi=∞) Ch2

ψi(t) Predicted Number of Cache Ways Ch2

Θ Weighting Factor Ch2

θ(t) Plant Parameter Vector (Adaptive Control) Ch3

θ(t)∗ Unknown Plant Parameter Vector (Adaptive Control) Ch3

θi Givens Rotation Anles Ch4

ϕ Regression Input (Regressor) Ch3,4

ϕ(k) Input Regression Vector Ch4

ψ Input Data Matrix Ch4

φ Regression Vector Ch3

Γ Adaptive Gain Ch3

ε(t, θ∗) Model Prediction Error Ch4

ε error vector (Least Square) Ch4

ε Posterior Error Vector Ch4

ξd(k) RLS Cost Function Ch4

ζ Damping Ratio Ch5

xx

xxi

xxii

List of Figures

2.1 Instruction Issue and Cache Miss for a Single Threaded Processor and 2 Threaded

Processor Supporting SMT[Thimmannagari, 2008] . . . . . . . . . . . . . . . 15

2.2 4-Way Set Associative Cache and Main Memory . . . . . . . . . . . . . . . . 20

2.3 Thread CPU Latency vs Cache Allocation for Different Scenarios [Fedorova

et al., 2006] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.4 SHARP Control Architecture [Srikantaiah et al., 2009] . . . . . . . . . . . . . 43

2.5 Block Diagram Adaptive System [Astrom and Wittenmark, 1994] . . . . . . . 48

3.1 Indirect(Explicit) Adaptive Control [Ioannou and Fidan, 2006] . . . . . . . . . 63

3.2 Direct(Implicit) Adaptive Control [Ioannou and Fidan, 2006] . . . . . . . . . . 64

3.3 Thread Execution Pattern Model . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.4 Closed-Loop CMP Control System Model . . . . . . . . . . . . . . . . . . . . 70

5.1 A General Linear Controller with Two Degrees of Freedom [Astrom and Wit-

tenmark, 1994] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2 Step Response of Closed Loop Transfer Function . . . . . . . . . . . . . . . . 106

5.3 Closed Loop Stability Analysis Bode Plots & Root Locus . . . . . . . . . . . . 107

6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.2 MATLAB Framework Simulation Flow . . . . . . . . . . . . . . . . . . . . . 120

6.3 Adaptive Instruction Count and Actual Instruction Count . . . . . . . . . . . . 123

6.4 Adaptive Instruction Count and Reference Instruction Count . . . . . . . . . . 123

6.5 Adaptive Miss Count and Actual Miss Count . . . . . . . . . . . . . . . . . . 124

xxiii

6.6 Reference Instruction Count and Actual Instruction Count . . . . . . . . . . . 125













xxiv

List of Tables

3.1 System Type vs State Space Model . . . . . . . . . . . . . . . . . . . . . . . . 62

6.1 Multi-Core CMP Architecture Constraints . . . . . . . . . . . . . . . . . . . . 119

6.2 Workload/Thread Definition & Scope [KleinOsowski and Lilja, 2002] . . . . . 121

6.3 Adaptive Scheduling Framework Experiments Results . . . . . . . . . . . . . 121

xxv

xxvi

Chapter 1

Introduction

This thesis proposes a cache-aware adaptive closed loop scheduling framework to address the

shared computing resource management problem of threads. Despite the significant techno-

logical advancement in multiprocessor architectural and computational capabilities, it is still a

challenge to predict the computational resource requirement of threads. The lack of knowledge

on resource demands causes an inefficiency in the overall performance of the multi-core chip

level multiprocessor architecture. In this context, our real-time closed loop scheduling frame-

work for multi-core multiprocessing platforms is designed to increase the processor resource

allocation efficiency on multi-core multiprocessing platforms. In other words, accurate cache

resource estimation and an efficient processor cycle allocation are the main goals of our research

project. In order to achieve these goals, fair resource allocation is enforced to each core in

a way that in case the cache allocation request is not the same for co-runner threads, the

fairness is imposed by changing the CPU cycles of each thread in a single CPU quantum.

In our project context, fairness is the proportional allocation of different computing resources

belonging to the same thread. For instance, for highly active threads with a relatively high

instruction count allocation, fair allocation imposes a proportional cache memory allocation

so that threads currently allocated high processor cycles will not be stalled due to lack of

resources on the one hand, cache memory on the other hand. As a result, each thread will

execute a number of instructions per cache allocation as under ideal circumstances in which a

dedicated cache is allocated [Fedorova et al., 2006]. In this context, the metric indicating unfair

cache resources allocation is the difference between instruction count executed in the dedicated

cache environment and instruction count achieved in the shared cache environment. In fact,

this difference is equivalent to the cache stall caused by unfair cache allocation. In this case,

1

2 CHAPTER 1. INTRODUCTION

our framework allocates more processor cycles to that particular thread to drive it towards an

instruction count in a dedicated cache case.

Our cache-aware adaptive closed loop scheduling framework addresses the thread allocation

and corresponding cache sharing problems by facilitating adaptive self-tuning regulator control

system architecture with parameter estimation algorithm, QR Recursive Least Square(RLS).

In other words, our cache aware adaptive closed loop scheduling framework can be decon-

structed as the cache aware scheduling system, which formulates the algorithmic solution to

thread allocation and cache sharing problem, and adaptive self-tuning control framework, which

ensures the consistency of the solution for time varying (adaptive) thread allocation and cache

sharing problem. In our adaptive self-tuning regulator architecture, cache-aware adaptive closed

loop scheduling framework updates the controller parameters in line with any changes in the

operating environment, which is in our case shared cache allocations of each co-runner threads.

As a result, our research contributes to an innovative closed loop scheduling framework which

guarantees a stable and efficient scheduling of multiple threads on a multicore chip multipro-

cessing platform with the shared L2 cache.

1.1 Background and Motivation

Optimal and efficient resource management of limited computing resources has been a signif-

icant research problem, which has been addressed with many different architectural and algo-

rithmic approaches. Emerging application parallelism in line with the architectural parallelism

has improved the computing resource availability. In fact, this is achieved by the architectural

innovation encouraging parallel subtasks (thread) to utilize a pool of computing resources rather

than dedicated limited resources. However, this architectural revolution has also brought in a

new research problem, effective management of these shared resource pools. As a result, there

has been significant research effort on computing resource allocation efficiency problems in

multiprocessor architecture supporting thread parallelism at both software and hardware level.

In our thesis, our effort is focused on the processor cycle allocation on a multi-core chip level

multiprocessor with shared L2 cache resources.

In general, existing resource management/allocation (scheduling) approaches can be clas-

sified into cache-aware scheduling algorithms and traditional scheduling algorithms. In the

1.1. BACKGROUND AND MOTIVATION 3

traditional thread scheduling there are two main factors involved: the priority and the affinity.

The priority is determined through inheritance; for instance, a thread or task priority can be

inherited from the thread (task) spawned it. The affinity assigns threads/tasks to a subset of

processors [Villani, 2001] based on the historical allocation considering the fact that remnants of

a thread/task may remain in cache of core from the last thread/task execution. Hence, assigning

these tasks/threads on the same core results in a more efficient processor utilization than allocat-

ing them on a different processor [Corporation, 2003]. In multiprocessing environments, one of

the most widely used algorithms is PFair class algorithms which address efficiency loopholes of

uniprocessor scheduling algorithms such as Earliest Deadline First (EDF) and Rate Monontonic

(RM).

Cache-aware scheduling algorithms address the inability of processors in analyzing memory

access behaviors of cache units. Cache-Aware scheduling can be classified into two groups as

cache-fair scheduling, and cache pattern-aware scheduling (placement) . The cache-fair thread

scheduling algorithms take advantage of the fact that fair sharing of resources among threads

will minimize the resource conflicts and bottlenecks. The cache pattern-aware scheduling

(placement) algorithms are based on partitioning scheduling algorithms, in which tasks are

statically scheduled among the available cores based on the scheduling criteria [Zhou et al.,

2009a]. That is, they analyze the data patterns of the threads at application levels and place the

threads into the most convenient cores according to their cache data patterns behaviors.

Simulation results as well as the scope of the existing cache-aware multi-core scheduling

algorithms indicate that these algorithms are generally static open-loop algorithms which cannot

adapt dynamically itself against the dynamic processor states transitions such as varying cache

demands of threads (Fedorova et al. [2006]; Zhou et al. [2009b]; Ebrahimi et al. [2010]; Kim

et al. [2004]; Jahre and Natvig [2009]). Hence, the efficiency of an algorithm is only verifiable

on a specific system state; and the algorithm performance will degrade to the extent a system

state or operating point gets far away from the expected value. As a result, there has been a

necessity of closed loop algorithms replacing traditional open loop algorithms in such operating

environments where system states and parameters are subject to a significant variation. In the

context of multi-core chip level multiprocessor scheduling, cache access patterns can also be

identified as highly unpredictable and uncertain due to the highly correlated nature of co-runner

threads.The simulation results of Tam et. al. [Tam et al., 2009] on a number of applications

indicate that the cache miss rate curves might show great differences for each of the applications


or threads ; so, it is really hard to predict the thread data reference pattern as well as the cache

miss rate in such highly dynamic environments. Hence, the dynamic closed loop framework is

required to adapt against varying L2 cache access patterns.

There have been previous attempts in using a closed feedback framework in cache allocation

and multi-core chip architecture platform, but to our best knowledge, there is no research

work implementing adaptive self-tuning control theory for such a problem. Srikantaiah et. al.

[Srikantaiah et al., 2009] use the formal feedback control theory for dynamically partitioning the

multiprocessor shared last-level caches, and this is achieved by the last level cache utilization

optimization among multiple concurrently executing applications on a well defined service

level platform. Srikantaiah et. al. [Srikantaiah et al., 2009] also explain advantages of using

formal feedback as a theoretical guarantee maximizing the utilization of the cache space in a

fair manner.

The actual challenge in such environments is the fact that a CPU clock cycle (quantum) is a

very small period. In addition, superscalar core architecture, multithreading at processor level

and cache replacement algorithms continuously change cache and processor dynamics on each

clock period/cycle. In this respect, the multi-core multiprocessor efficiency and utilization can

be subject to a significant worst-case system destabilization. As a result, advanced closed loop

tools for a multi-core scheduling algorithm are proposed to track these multi-core chip level

multiprocessor cache behaviors.

In this regard, cache behavior can be classified as steady state cache and transient cache

behaviors. Steady state cache behaviors refer to the cache behavior after initial cache memory

blocks are allocated to the process/thread and requested initial data is successfully placed in

the corresponding blocks on chip cache units. In connection with steady-state cache behaviors,

a significant observation from the miss rate curves [Tam et al., 2009] indicates that after the

initial cache allocation for an application, miss rate curves(MRC) of the application approach

to a specific cache miss value. However, transient cache behaviors refers to the cache behavior

during the unsettled period from the initial memory access request to the placement of the initial

data in the cache memory. Transient cache behaviors ,which indicate how fast miss rate curves

approach their limits, show a strong dependence on the application data coherence. For instance,

if an application utilizes a streaming data, the cache content needs to be updated continuously;

1.1. BACKGROUND AND MOTIVATION 5

hence, the miss rate curve(MRC) for this type of application will have a versatile and non-

uniform characteristic. In contrast, some of applications have high data coherence, so these

applications regularly utilize the existing data in the cache. In this case, the miss rate curve is

expected to be uniform around a limit after a short transient state (settling period).

In such operating environments, different advanced control frameworks are applicable de-

pending on the system model and characteristics. Green and Limebeer [Green and Limebeer,

1995] also emphasize the insufficiency of the classical control approach, in case the plant

dynamic is complex and includes significant amount of uncertainties. As a result, the advanced

control tools such as robust optimal control and adaptive self-tuning control methods are prof-

fered. In our research, two alternative frameworks are researched: a robust control framework

and an adaptive self-tuning control framework. One of the most common robust control meth-

ods, H∞ optimal control, was developed in response to modeling errors and uncertainty and the

basic philosophy is to optimize worst-case errors [Green and Limebeer, 1995]. However, in this

scenario, the system dynamic is known and can be modeled with a predicted error/uncertainty

range (worst-case errors).

In contrast to robust control, an adaptive self-tuning control framework provides a higher

degree of flexibility in a plant model such that the adaptive self-tuning control framework

considers parameters of a plant model as time-varying entities and estimates these parameters

based on input and output measurement data. Ioannou and Fidan [Ioannou and Fidan, 2006]

state that adaptive self-tuning control is a very powerful tool in stabilizing the system in which

plant parameters have no bounded certainty, as well as in estimating online disturbance and

canceling its effect via feedback for this system. To some extent, a robust control framework

can also address these challenges; however, upper bound should be specified by the designer. In

contrast, the adaptive self-tuning control framework includes online learning capabilities about

plant dynamics, so, it is unnecessary for the designer to define any bound or restriction [Ioannou

and Fidan, 2006]. However, online learning would track only slowly varying parameters.

Further detailed discussion will be conducted about the limitations and benefits of these two

approaches in the literature review section.

In regards to the adaptive self-tuning (closed loop) control framework, the most important

and critical challenge is the modeling of the execution pattern of the thread, including instruc-

tion dynamics and cache patterns on multi-core multiprocessor platform. In this case, thread


execution pattern can be represented in terms of two dynamics: instruction count dynamics and

cache access pattern dynamics. Here, instruction count dynamics can be considered overall

closed loop system responses; in other words, actual execution patterns. On the other hand,

L2 cache access pattern model plants of the thread can be considered as a part of the overall

dynamics. In this respect, L2 cache access pattern, in fact, is an unknown constraint to our over-

all closed system. In developing L2 cache access pattern models, existing cache replacement

policies, and cache structure are considered. The paper of Suh et. al. [Suh et al., 2001] , which

proposes an analytical cache model for time-shared systems with a fully associative cache, are

one of the contributing references at this stage.

The implementation of any advance control framework as a solution in our research also

requires related knowledge on the system theory, statistical estimation techniques, parameter

estimation and basic linear and complex algebra. The limited discussion regarding these topics

will be done in the literature review section.

1.2 Research Problem

As stated in the previous chapter, efficient resource allocation, especially in a high complex

processor architecture as in multi-core chip level multiprocessor architecture with shared L2

cache memory, is one of the remarkable challenges for any multiprocessor researcher. In gen-

eral, execution slot/processor cycle allocation, in other words time centric scheduling problems,

are researched. In this thesis, the co-runner cache dependency on the time centric scheduling of

threads has been investigated.

In fact, the main motivation behind investigating particularly shared cache impact on the

time-centric scheduling framework is due to the performance advantages of shared cache uses

particularly for the chip level multiprocessor architecture with software parallelization such as

multithreading and hardware parallelization superscalar core architecture. Namely, shared L2

cache maximizes the cache utilization in case of core’s execution resources are underutilized,

and minimizes the data duplication in cases of co-runner threads sharing the same working

data set [Siddha et al., 2007]. The feasibility and advantages of the L2 shared cache design

on the S/390 G4 IBM server platform was also verified by the empirical performance data,

which was obtained by a series of experiments; particularly, significant improvement in cache

1.2. RESEARCH PROBLEM 7

hits over dedicated cache scheme was observed [Mak et al., 1997]. However, a shared cache

memory management turns into an additional factor having impact on the thread’s execution

performance, and overall performance of the multi-core chip level multiprocessor. In this case,

there emerges an obvious dilemma between dedicated and shared cache deployment for better

overall system performance. In most of the cases, it is inevitable to have heterogeneous data

access patterns of memory-intensive individual threads. The inefficient cache management with

these access patterns will result in cache contentions, cache misses and suboptimal performance.

Cache miss is actually referred to an unsuccessful attempt of cores to read or write a particular

word in the cache memory. As a result, the current write or read operation is delayed until

the memory word is retrieved from the main memory location. In other words, a cache miss

degrades overall performance of a multi-core processor proportional to a memory stall caused.

By definition, cache contention is caused by false sharing; namely, when multiple cores try to

update the same cache line, each has to be exclusive owner of the cache line in turn, so it slows

down the execution [SiliconGraphicsLibrary, 2000]. Impact of the contention on performance

is dependent on the data access pattern of the thread, resources shared and the number of active

threads. All in all, shared cache complexity basically changes the time centric scheduling

phenomenon as: a fair amount of CPU time allocated to each thread is not essentially reflected

into optimal and fair usage of shared resources; as a result, multi-core chip multiprocessor

overall utilization and performance might be degraded by cache related problems such as cache

contentions and misses. In line with the main research problem stated above, it is possible to

derive more specific research questions directly related with the proposed framework, which

addresses the main research problem. In this thesis, our effort is to address this main research

problem with a modern control framework. Hence, it is also necessary to mathematically

model the cache aware time-centric scheduling problem. Our model considers the execution

and cache access pattern of the thread. Due to the amount of uncertainty involved in our model,

particularly in cache access pattern, the adaptive self-tuning control framework is preferred. As

a result, the main research problem investigating co-runner cache dependency on time-centric

scheduling is degraded into more specific technical questions, which helped us to clarify issues

and address challenges related to our cache-aware adaptive closed loop scheduling framework.

The research questions emerged throughout our research effort can be summarized into three

main categories:


1. Thread Resource Allocation Strategy

(a) How can thread performance bottleneck, which results from shared L2 cache re-

source conflict among threads, be overcome?

(b) What would be the relation between thread execution pattern and cache access

pattern of threads?

2. Adaptive Self-Tuning Control Framework Related Research Questions

(a) Which adaptive self-tuning control strategy is applicable to the thread execution

resource management problem in multi-core multiprocessor architecture in line with

our research goals and scope?

(b) How can a classical adaptive self-tuning control framework be integrated into our

research problem and our closed loop dynamic system model?

(c) Closed Loop Dynamic System Model Related Research Questions

i. How can the L2 cache access pattern and its impacts on thread execution per-

formance be formulated including the selection of optimization criteria, perfor-

mance metrics, and controlled inputs in line with our research scope and goals?

ii. Which mathematical strategies can be helpful in the formulation of cache mem-

ory metrics such as cache miss counts and instruction counts, and control inputs

such as CPU quantum?

(d) System Identification Related Research Questions

i. Which dynamic system model parameters are subject to the system identifica-

tion as a part of the adaptive self-tuning control framework?

ii. Which system identification strategies are applicable into the existing adaptive

self-tuning control framework?

(e) Algebraic Controller Design Strategy Related Research Questions

i. Which algebraic controller design method is suitable for our adaptive frame-

work?

ii. What would be the desired adaptive closed loop characteristics?

3. Simulation and Experiment-Related Research Questions

1.3. SCOPE AND LIMITATION 9

(a) What would be the statistical metrics required by the adaptive scheduling frame-

work?

(b) Which experiment/test cases would be more informative about the success of the

system?

(c) What would be the success criteria for the cache-aware adaptive scheduling frame-

work?

The innovative scheduling control framework for the multi-core chip level multiprocessor

computing environment which challenges the existing scheduling frameworks in terms of per-

formance and feasibility is guided by the questions above and the responses to them.

1.3 Scope and Limitation

Due to the time frame of our research project and the complexity of processor architecture

models, the mathematical model of the thread execution pattern considers only specific L2

cache patterns with limited factors. This significantly simplifies the thread execution model.

The abstraction is achieved by not including specific chip level multiprocessor features

as direct constraints in the model, such as simultaneous multithreading (SMT) and L2 cache

memory features as in associative cache ways. In fact, parameter estimation algorithms on this,

more abstract level model, consider the system as a whole with its features since estimation is

based on the actual input and output data. Hence, modeling the system at a more abstract level

does not cause any loss of accuracy.

Despite the fact our framework is applicable to any computing application domain, a sig-

nificant performance improvement comparable to traditional scheduling approach can only

be achieved for applications, which exhibit weak temporal cache locality and high cache re-

quirements. For threads, which have a low cache space requirements and a strong temporal

cache locality, cache performance does not have a significant impact on overall processor

performance. In addition, it is expected that in case of co-runner thread does not have a

significant cache demand, which impacts on the cache performance of the target thread, the

framework acts as a passive framework.

As for the limitation at parameter estimation performance, the accuracy of the parameter


estimation is also dependent on the input and output data characteristics; that is, cache behavior

of threads. For instance, the input measurement with lack of coherence results in inaccurate

estimation. Hence, it is inevitable to have an error margin dependent on the cache pattern of

threads.

1.4 Contributions

The primary deliverable of the thesis is the adaptive cache aware closed loop scheduling frame-

work. The adaptive scheduling framework can be considered as real-time adaptive set of

algorithms, which applies the instruction execution fairness algorithm to the threads exhibiting

highly unpredictable cache and execution behaviors, and unanticipated state changes. In this

context, the adaptive cache-aware scheduling framework includes two major and three minor

contributions: execution fairness algorithm, thread execution pattern model, parameter estima-

tion algorithms, controller design algorithm and design of additional counters for simulation

purposes.

Execution Thread Fairness Algorithm In order to decrement the thread dependency on the

co-runner thread resource demands, an execution fairness algorithm is developed. This

algorithm considers the number of instructions executed in a dedicated resource environ-

ment, particularly in dedicated cache resources as a reference instruction count, and the

number of instructions executed in a shared cache environment as an actual instruction

count. In such a context, the execution fairness algorithm enforces the actual instruction

count to converge with the reference instruction count by allocating extra execution slots.

In this scenario, the cache behavior has a significance in determining the impact of an

allocated execution slot to the actual instruction count.

Thread Execution Pattern Model To achieve real-time adaptive system response against time

varying resource demands, a thread execution pattern model is developed. This model

considers the instruction count and cache resource pattern of the thread, and provides

state space equations, which allow us to construct a time-series regression model for our

parameter estimation algorithms. This model can be considered as an innovation due to

additional thread level granularity and cache awareness included in the model.

Parameter Estimation Algorithm QR Recursive Least Square algorithm is utilized in our

1.5. THESIS OUTLINE 11

real-time adaptive cache-aware scheduling framework. QR RLS algorithm achieves the

estimation of cache resource patterns using time series regression statistics of miss count

and co-runner miss counts. The actual contribution is in fact the development of a

regression model based on the time series miss count statistics. Applying a real-time

deterministic estimation algorithm can also be considered as an innovative approach

since conventional scheduling algorithms or resource allocation frameworks generally

prefer probabilistic strategies to predict the state of the system. As a result, in contrast to

probabilistic approach, our framework is able to track faster pattern transitions.

Controller Design Algorithm In order to track reference instruction count irrespective of ex-

ecution pattern and cache behaviors, an algebraic controller design algorithm, which is

able to re-design a new controller based on system states, is facilitated. Particularly, a pole

placement algorithm is selected for this purpose. The main contribution is the formulation

of the controller parameters in terms of execution pattern plant coefficient, which is in fact

estimated parameters.

Additional Counters for M-Sim In addition to the actual framework, in order to collect time-

series regression statistics, small counters and software modules are added to open source

M-SIM platform. More specifically, a software module, which is able to collect cache

hit/miss counts, and instruction counts for a sample period of 0.005sec or 500 cycle, is

developed. As a result, time series statistics, which are used in the parameter estimation

algorithm, are retrieved.

1.5 Thesis Outline

This thesis is divided into seven chapters. Following the introduction, Chapter 2 provides a

preliminary literature review on multiprocessor scheduling algorithms, multi-core multiproces-

sor architecture and adaptive self-tuning control frameworks. Chapter 3 provides an initial

mathematical model of thread execution patterns for the adaptive self-tuning control framework.

In Chapter 4, parameter estimation strategies for the cache access pattern are developed. Based

on the mathematical models, QR Recursive Least Square (RLS) algorithm is implemented to

capture the cache miss pattern of the thread. Chapter 5 investigates the algebraic controller


design algorithms, which are applicable to adaptive self-tuning control frameworks. Partic-

ularly, the Pole Placement method is investigated and utilized to iteratively design controller

parameter. Chapter 6 concentrates on simulation of the cache aware adaptive closed loop

scheduling framework, and interpretation of simulation outcomes. Chapter 7 concludes the

thesis and indicates future directions.

Chapter 2

Literature Review

In this chapter, the literature in the relevant research field is reviewed. In line with our topic,

adaptive cache-aware processor resource allocation, the literature in multi-core multiprocessor

system architecture, multiprocessor scheduling theory, and modern control theory has been

examined. Relevant findings have been summarized in the following sections.

2.1 Multi-Core Chip Level Multiprocessor System Architecture

Multi-core chip level multiprocessor(CMP) system refers to a system composed of two or more

cores integrated on a single chip. Despite the fact this on-chip architecture increases the overall

computing capabilities significantly, the multicore CMP systems have still some loopholes

in optimally managing resources among cores. In other words, how on-chip resources are

shared among cores is the actual challenge in these systems, and this impact overall system

performance. As a consequence, the processor designers endeavor to propose different mech-

anisms, and design options to optimize the resource management at chip level. These new

design approaches are implemented on three hardware levels:(1)Core Structure (2) Memory

Architecture (3) Core Diversity at chip level.

2.1.1 Core Structure

As an atomic element of multi-core systems, a core is an independent execution unit, as well

as an atomic entity of the whole system of which performance and functionality will form the

overall performance and functionality. As an independent execution unit, each core has Level 1

13

14 CHAPTER 2. LITERATURE REVIEW

(L1) data and instruction caches, ALU, registers, and some other hardware units. By definition,

core is a unit that reads and executes program instructions. For the smooth operation of a

program, instructions should be processed rapidly; however, a single core can only process one

instruction at a time. To improve the performance of instruction, a core designer takes advantage

of parallelism phenomena.

Recent development in parallelism has introduced new concepts such as the Superscalar

processor architecture. The Superscalar processor architecture was presented to improve in-

struction level parallelism provided by instruction pipelines in addition to the traditional ones

for instruction pipelines and branch predictors. According to Olukotun [Olukotun, 2007],

superscalar processors were developed to execute multiple instructions from a single instruction

stream on each cycle. In other words, multiple instructions can be executed by the processor

in a pipe stage. This is achieved by dynamically identifying a set of instructions capable of

parallel execution from instruction stream on each cycle, and executing them.

Despite the superscalar architecture capability of issuing multiple instructions on each cycle,

instructions on each slot are limited by a single process, so while a specific process is dedicated

to a core, no other instruction belonging to another process can be fetched in any of the exe-

cution slots. Hence, there occurs a significant slot waste since instructions are stalled due to

the retrieval of processor resources. In this context, multithreading is introduced to diminish

waste of execution slots in superscalar pipeline architecture. Multithreading is a thread level

parallelism on the core level. As discussed above, for instance, an n-way superscalar processor

has a n instruction issue slot for each cycle, and each slot includes a single instruction to be

executed in a pipe stage on a single cycle. Thus, maximum n instruction can be executed in

a pipe stage on each cycle. In case core supports only a single thread, then all execution slots

will belong to a single process for a dedicated time, and some execution slots in pipeline stages

will be wasted due to stall-caused memory access latencies or branch prediction failures. As a

result, there will be both horizontal and vertical waste. Horizontal waste occurs when some of

the execution slots out of n slot (n-way superscalar) are not be used on a single execution

cycle. Vertical waste refers to when a case execution cycle goes completely unused, so n

slots (n-way superscalar) are wasted [Thimmannagari, 2008]. Thimmannagari defines four

different multithreading techniques: vertical threading, simultaneous multithreading (SMT),

branch threading, and power threading [Thimmannagari, 2008]. In vertical threading (VT),

only instructions from one particular thread occupy a given pipe stage; in other words, multiple

2.1. MULTI-CORE CHIP LEVEL MULTIPROCESSOR SYSTEM ARCHITECTURE 15

threads share superscalar processing resources in aggregate but not in the same execution cycle.

Thus, this type of multithreading brings a complete solution to vertical slot waste. However,

simultaneous multithreading (SMT) permits independent threads to issue instructions to super-

scalar’s functional units on a single cycle. Hence, SMT addresses both vertical and horizontal

waste of execution slots. Despite little differences in efficiency, both SMT and VT aim to

minimize waste of slots in a superscalar processing architecture. For instance, as shown in

Figure 2.1, Thread 0 experienced a cache miss on fifth time slot of the thread issue slot given in

Figure 2.1b ; in this case, Thread 1 takes over the execution slot. Hence, memory stall doesn’t

lead to any execution time slot waste, and only causes stall on that particular thread, Thread 0.

By contrast, the branch and power threading address branch prediction and power related issues,

respectively. While branch threading offers thread switch based on hitting branch to prevent

branch penalty for processors not supporting dynamic or static branch prediction techniques,

power threading offers thread switching based on the power dissipated by a particular thread to

keep average power dissipated within specifications [Thimmannagari, 2008].

Figure 2.1: Instruction Issue and Cache Miss for a Single Threaded Processor and 2 ThreadedProcessor Supporting SMT[Thimmannagari, 2008]

In this context, our core structure uses instruction level parallelism with an instruction

pipeline on a superscalar core architecture; in addition, SMT is also supported by our core


architecture. For instance, if our core structure, as in IBM POWER5 core, is on 8-way su-

perscalar architecture, then ’8’ threads’ instructions can be implemented per execution cycle

[Vetter et al., 2006].

2.1.2 Memory Architecture

Memory architecture on a multi-core CMP platform refers to on-chip memory architecture, so

called on-chip cache architecture. Due to the emerging applications being heavily dependent

on streaming and shared data sets, on-chip cache architecture gains significance for multi-core

CMP system efficiency and utilization. That is, off-chip data communication between cores

and main memory causes significant stalls on systems, while well designed on-chip cache can

minimize the off-chip data communication significantly. As a result, research efforts in the

cache architecture have put forward innovative solutions such as multi level cache structure and

additional accessibility capabilities as in shared caches. For instance, Intel Itanium 2 series

multi-core processors offer maximum reliability and minimum cache errors for enterprise level

servers. They are developed with a three level private cache memory architectures for each core

to minimize cache instability and errors [Intel, 2006]. However, dedicated or private caches can

rarely be utilized at their full capacity, so this leads to the waste of available resources. Namely,

the dedicated cache of core will be underutilized if there are not enough cache memory requests

from threads run on the core; on the other hand, is over-utilized in the case where the threads

request more cache memory than dedicated cache memory. Hence, most of the multi-core

multiprocessor architectures including AMD Opteron, AMD Athlon, Intel Pentium D have one

level private cache, so called Level 1 (L1) cache, and the shared cache, Level 2 (L2) cache

[Pfenning and Barbic, 2007]. Therefore, additional accessibility features as in shared caches at

some level of cache architecture is a popular strategy in cache architecture design.

In addition to accessibility characteristics of cache, memory access characteristics of shared

caches can be another memory architecture design option. According to Kent and Williams, uni-

form memory access(UMA) memory architecture refers to shared memory structure in which all

locations have the same access characteristic, including access times [Kent and Williams, 1997];

whereas, in Nonuniform Memory Access(NUMA) architecture, memory architectures can be

designed such that memory access characteristics including access times can be dependent on

the location relative to core [Tanenbaum, 2005] ; for instance, high capacity cache can be

2.1. MULTI-CORE CHIP LEVEL MULTIPROCESSOR SYSTEM ARCHITECTURE 17

deployed close to core with high data streams. Our research will be limited with the uniform

memory access shared cache architectures. NUMA shared cache architecture is generally

deployed in the heterogeneous core architecture. The further discussion on cache architecture

will be conducted under cache architecture and policies sections.

2.1.3 Core Diversity and Parallelism

The core diversity refers to the homogeneity of cores on the multicore platform. In this regard,

there are two different multi-core processing system architectures. The first architecture is the

homogeneous multi-core architecture where each core has same specification and role. The

multiprocessing systems which treat each core equally is called symmetric multiprocessing

systems(SMP) and those cores having the same specifications and features are called as homo-

geneous cores. For the sake of simplicity, our scope is limited to symmetric multiprocessing

systems and homogeneous core architecture.

As for parallelism in homogeneous cores, homogeneous multi-core processor platforms can

support multiple levels of parallelization; each core may implement any of the architectures

like superscalar (instruction level parallelism), VLIW (instruction level parallelism), SIMD

(data level parallelism), vector processing (data level parallelism),or multi-threading (task level

parallelism) depending on application domains. In other words, homogeneous multiprocessor

platforms can implement low level instruction parallelism with superscalar and pipelining ar-

chitecture, task (thread) level parallelism with multithreading or multitasking, and data level

parallelism with SIMD or vector processing architecture. For example, the IBM POWER5

Dual Core processor is on 8-way superscalar architecture (8 Threads per core) and support

simultaneous multithreading [Vetter et al., 2006]. SUN T1 8 core processor is on 4-way pipeline

architecture (4 Threads per core) with fine-grained multithreading [Microsystems, 2006].

In our research, our homogeneous multi-core chip level multiprocessor system takes advan-

tage of instruction parallelism with instruction pipelines on 8-way superscalar architecture, and

thread level parallelism with simultaneous multithreading(SMT) technology. As a result, only

task(thread) level parallelism is considered in details, and the cache related thread scheduling

challenges are addressed. In this respect, our research scope is limited to optimal thread

scheduling at multi-core chip processor level, which minimizes stalls due to cache misses.

Despite recent research on the thread scheduling having addressed performance issues with


thread switching methods, our research goal is to put forward an alternative approach that

optimizes thread level scheduling.

2.2 Cache Architecture and Policies

Computer system memory systems has a hierarchical structure from highest level memory

unit (processors registers) to lowest unit memory (off line storage units). Each tier in this

architecture have different characteristics; for instance, as users go down the memory hierarchy,

they could observe decreasing costs/bits, increasing capacity, and slower access times. Each

memory system unit can be characterized using the following key characteristics [Stallings,

2003]:

1. Location: Processor, Internal(Main), External(Secondary)

2. Capacity: Word size, number of words

3. Unit of Transfer: Word, Block

4. Access Method: Sequential, Direct, Random, Associative

5. Performance: Access time,Cycle time, Transfer Rate

6. Physical Type: Semiconductor, Magnetic, Optical, Magneto-optical

7. Physical Characteristics: Volatile/nonvolatile, Erasable/nonerasable

8. Organization

The main motivation in creating such hierarchical structures is to boost the memory access

speed with optimal architectural cost and efficiency. Hence, data is duplicated or partitioned

among these different layers, such that processor access time for data in memory architecture

will be minimized. Throughout the discussion issues within the memory hierarchy, three fun-

damental concepts play a critical role [Stacpoole and Jamil, 2000] :

Inclusion Inclusion property indicates that all information elements are originally stored in

outermost level (e.g disk) Ln in the memory hierarchy, and subset of the original data and

instructions are moved up the hierarchy towards processor registers L0 during execution

2.2. CACHE ARCHITECTURE AND POLICIES 19

of process. Conversely, as data in L0 ages, it moves down back to outermost level Ln. In

addition, inclusion also covers methods and units of data transfer between two levels of

hierarchy.

Coherence Coherence property implies that copies of the same information or data elements

should be consistent all over the memory hierarchy.

Locality The locality property refers to clusters or isolation of behavioral characteristics of

data and memory demands of processes within certain regions of time and space. The

memory hierarchy is developed based on these behavioral characteristics of CPU. There

are two kinds of locality principles:

Spatial Locality occurs when numerically close memory locations adjacent to recently

accessed memory location, in the same block memory, are likely to be accessed in

the near future [Tanenbaum, 2005].

Temporal Locality In case at one point in time a particular memory location is accessed,

then it is probably that same location will be accessed again in the near future.

Based on these principles, detailed discussion on cache architecture, design elements, is con-

ducted in the following section.

2.2.1 Cache Architecture

In such a hierarchy, cache memory is a high-speed semiconductor volatile memory used by a

computer processor for temporary storage of information [Anthes, 2000]. Stallings [Stallings,

2003] define elements of the cache architecture as:

• Cache Size refers to the physical memory units allocated to the cache.

• Line Size refers to the addressable cache memory unit by the processor.

• Number of Caches: Single or two level/ Unified or Split indicates the number of levels

in the cache memory architecture of the processing platform.

• Mapping Function: Direct/Associative/Set-Associative refers to how actual physical

memory blocks are mapped into cache lines. For instance, Figure 2.2 demonstrates the 4-

way set associative cache architecture.In this figure, the cache contains a copy of portions


of main memory, so called memory blocks as shown in Figure 2.2. The main memory

is divided into K word memory blocks, and each of these blocks is mapped into a single

cache line. Hence, set-associative mapping is a function which maps a set of memory

blocks to the specific cache line; whereas, associative mapping maps an individual block

of memory to any cache line. Direct mapping is a static mapping between cache lines and

memory blocks.

• Replacement Algorithm: Least Recently Used(LRU)/ Most Recently Used(MRU)/ First

in First Out(FIFO)/ Least Frequently Used(LFU)/ Random/ Adaptive & Dynamic Cache

Replacement Algorithms signify the policy, which determines which existent cache line

might be replaced as a new data block arrives from the physical memory.

• Write Policy: Write Through/ Write back/ Write Once is the policy which decides how

updates or changes on the data in the cache memory can be written back to physical

memory.

As mentioned previously, the main design objective of cache memory is to speed up the

memory access times based on the locality property of data patterns. Hence, these elements in

fact concerns how physical data from memory is mapped to the cache and how fast the requested

data can be served to the thread.

Figure 2.2: 4-Way Set Associative Cache and Main Memory

Based on these elements, different cache architectures can be constructed, and possibly each

unique cache design might be a good performer or a bad performer depending on the thread

2.2. CACHE ARCHITECTURE AND POLICIES 21

cache access pattern. As a result, it can be concluded that there is no best cache design.That

is the reason we consider the cache related issues as inevitable and try to compensate cache

related performance degradation using other resources.

2.2.2 Cache Performance Indicator: Cache Miss

In our project context, cache architecture refers to a single metric, which indicates the cache

performance for a particular thread. In this regard, one of the most widely used cache perfor-

mance indicators, cache miss, is used. Cache miss refers to the situation in which a requested

block is not found in the cache and has to be fetched from its original storage location or

lower level caches. In fact, a cache miss is not only an indicator for the correctness of a

design decision but also an indicator of the data pattern behavior of applications running on

a processor. Put differently, cache miss measures cache architecture response to the application

computational resource demands; hence, it is an unpredictable and dynamic indicator to be

reasoned for better system performance. In this regard, applications’ resource patterns as well

as the cache architecture characteristic should be considered. According to source of the cache

miss, cache miss can be categorized into three [Stacpoole and Jamil, 2000]:

Compulsory Miss A compulsory cache miss occurs when the very first access to the requested

block is failed, in this case the requested block is retrieved into the cache from the main

memory. This is also called cold start or a first reference miss. It is very hard to prevent

this type of miss due to unpredictability of application resource demands or data patterns,

and for such unpredictable patterns, the only solution might be having speculative loads

of data in the cache before data is required. Namely, larger cache block size can reduce

compulsory misses to the extent limited by the locality property of applications’ data

access patterns.

Capacity Miss A capacity cache miss occurs when the processor requests a cache block which

is already discarded. A capacity miss usually happens throughout the execution of large

programs or processes since it is impossible for the cache to hold all the blocks used

or needed by such large process. In the prevention of capacity misses, the replacement

policies as well as the cache size play an important role.


Conflict Miss Two memory blocks corresponding to the same cache line number may be ac-

cessed repeatedly, and each one replace the other one in turn. This process is called

trashing. In the trashing process, requesting a new block causes another block to be

discarded [Stacpoole and Jamil, 2000]. Conflict miss occurs when cache request for

a new block is failed due to the same cache line being occupied by another memory

block. Conflict misses take places only in the direct-mapped or set-associative caches.

The conflict misses increase proportionally to the number of memory blocks mapped into

the same cache line number. In addition, larger block size also increases conflict misses.

However, using full-associative mapping prevents conflict misses, but there is a significant

implementation cost, and slowdown in cache access times.

As suggested, cache design elements impacts on cache misses improvement are limited with the

data locality characteristics of the applications, and processes. For instance, compulsory misses

are unpredictable at cache level before the actual request takes place, so those misses can only

be addressed by dynamic and stochastic cache management strategies, as well as application

data pattern prediction algorithms. In our thesis, cache misses are used as a metric, which is

involved in the estimation of the cache pattern access and the stall caused in the overall processor

performance.

2.3 Multiprocessor Scheduling

Scheduling is the act of assigning resources to activities or tasks. Scheduling problems include

a set of constraints such as deadlines to be met by any schedule. For any particular set of

constraints, two problems should be addressed. The first problem is the decision problem

which determines whether a given instance is feasible (schedulable) or not; the second problem

is the scheduling problem which builds the schedule for a given feasible instance. Hence,

any scheduling strategy considering a special scheduling problem can be partitioned into two

sub-problems: decision problem and scheduling problem [Baruah et al., 1996]. While the

decision problems is used in complexity analysis, the actual scheduling problem derives the

actual schedule for a given instance; hence, optimality of solution completely depends on

the scheduling problem. At this point, the actual difference between multiprocessor systems

and uniprocessor systems lies in the fact that the multiprocessor system’s feasibility does not

guarantee the optimality of the solution while uniprocessor system’s feasibility does.

2.3. MULTIPROCESSOR SCHEDULING 23

Despite the fact that emerging multiprocessing systems are an important innovation for

the computational capabilities of computing systems, multiprocessor systems also bring an

important challenge in extension of existence uni-processor real-time scheduling algorithms

into multiprocessing environment. In our research, our main interest is concentrated on the

multi-core chip level multiprocessor (CMP). One of the main characteristics of such systems is

the existence of shared memory and a common base of time [Cottet et al., 2002]. In this way,

each processor (core) has a global view of each processor (core) at any instant of time. In such

a processing environment, a scheduling algorithm is valid if and only if all task deadlines are

met. As a result, despite strong analogies between strongly coupled multiprocessing systems

and centralized systems (uniprocessor), Cottet et. al. [Cottet et al., 2002] underline the major

incapability of traditional centralized approach. According to Cottet et. al. [Cottet et al., 2002],

the traditional real-time scheduling on multiprocessors cannot be an optimal scheduling in such

multiprocessing environments.As a proof, he illustrates the EDF failure in satisfying optimality

on multiprocessing environment. Hence, the traditional real-time scheduling approach on the

uniprocessor platform should be adapted to the multiprocessing environment.

Throughout this section, multiprocessor scheduling taxonomy is introduced with the rel-

evant discussion on our multi-core chip multiprocessing(CMP) scheduling problem. Then,

traditional real time multi-core CMP scheduling is discussed. Before having a discussion on

the cache-aware scheduling, resource estimation and access pattern analysis are addressed in

details. Then, this is followed by the discussion on cache-aware scheduling covering the cache-

fair thread scheduling, and then pattern-aware cache scheduling(replacement) in details.

2.3.1 Multiprocessor Scheduling Taxonomy

The multiprocessor scheduling problem refers to a problem of pre-emptively scheduling a

real time task set on a symmetric multiprocessor (SMP) consisting of a of number of cores.

According to Bertogna et al. [Bertogna et al., 2009], this problem can be tackled in two different

ways [Bertogna et al., 2009]: (1)by partitioning tasks to processors (2) global scheduler. The

partitioning method analogous to the bin packing problem is in fact NP-Hard. However, for

a fixed task set and known priority, the problem can be degraded into a number of NP-easy

problems after initial partition of the task. In this case, this method provides a simple and

efficient approach. Otherwise, for an unpredictable or unknown processing environment, either


online partition algorithms or additional load balancing algorithms should be used.

As an alternative to the partitioning approach, global scheduling has a single system-wide

queue of ready tasks; tasks stacked in this queue are scheduled on available computing resources

[Bertogna et al., 2009]. In contrast to the partitioning approach, different instances of the task

can be allocated to different cores and dynamic runtime workload changes can be addressed

since schedulability metric, the actual workload of the system, can be retrieved at each instant;

so, any instantaneous workload changes will be immediately reflected in the scheduling deci-

sion. In addition, if task migration is allowed, real-time load balancing can also be done on the

overall system. In that sense, global scheduling is more appropriate for the systems having time-

varying, unpredictable workloads [Bertogna et al., 2009]. Considering this fact, our scheduling

framework will be a global scheduling algorithm, rather than a partitioning algorithm.

In the multi-core chip level multiprocessor (CMP) system, a shared memory cache as well

as dedicated caches per core are deployed on Chip Level Multiprocessing (CMP) architecture;

in addition, each core is capable of handling multiple threads simultaneously, so called multi-

threading. Hence, in our scheduling strategy, it is possible to determine dynamic priorities at

thread level. The value-based scheduling policy, which considers the scheduling problem as an

optimization problem with additional constraints and parameters such as cache allocations and

multithreading, is used to determine the dynamic priority of each task. In other words, the set

of coefficients, which give an optimal solution to our scheduling optimization problem, are the

indicators of dynamic priorities of tasks.

Despite migrating threads among cores being an option, the necessity of migration on the

platform where resource sharing is implemented are main concerns. That is, migration of thread

to another core requires thread context switching, which moves the thread stack, including

pointers, L1 caches instruction and data states into a new core. In fact, all these operations

cause a delay due to transfer of information or states among cores. Despite all these processing

overheads and stalls, the thread migration is critical in some multi-core architecture such as the

NUMA (Non-Uniform Memory Access) architecture. In the NUMA architecture, cores have

faster access to their local cache than shared memory, on heterogeneous multi-core systems,

where each core is designed for a specific purpose and specifications such cache memory

capacity. In such environments, the initial thread placement will be based on data locality

of the thread. This causes specific cores to be overloaded due to the poor initial placement. In

2.3. MULTIPROCESSOR SCHEDULING 25

some cases quick load variations can lead to poor utilization of the core [Itzkovitz et al., 1998].

In these conditions, thread migration will make it possible to continue the execution of such

threads.

In our cache aware adaptive closed loop scheduling framework, thread migration is ignored

because in our framework, the stability of each core is very critical. That is, our closed loop

framework only addresses the steady state dynamics. In other words, thread migration would

cause our closed loop framework instability and this is not a desirable for our system.

2.3.2 Real-Time Multi-Core Multiprocessor Scheduling Algorithms

The real-time scheduling algorithms designed for the uni-processor platform fail to provide

optimality on multiprocessing platforms. In other words, the scheduling algorithm allocating

’m’ resources into an ’n’ processor or cores couldn’t realize optimal allocation, even if it

achieves a feasible (schedulable) allocation.

In other words, despite most of the traditional scheduling algorithms such as Earliest Dead-

line First(EDF), Least Laxity First(LLF) being feasible algorithms on multiprocessor systems

[Cottet et al., 2002], the actual problem is their failure in providing an optimal scheduling

rather than a feasible one. As a result, all these algorithms are outdated for the multiprocessing

platforms.

In this regard, the Pfair class of global scheduling algorithms is one of the major algorithms

that provides an optimal solution to multiprocessor scheduling problems. Pfair algorithm pro-

posed by Baruah et al. manage to optimally solve the problem of scheduling for multiprocessor

systems, which was previously believed as NP-hard problem, in a polynomial time [Anderson

and Srinivasan, 1999]. The scheduling strategy of Baruah et. al. considers the proportionate

progress; each task is scheduled resources in proportion to its weight. This is called proportion-

ate fairness [Baruah et al., 1996].

To degrade the worst-case computational complexity of the PF algorithm, Brauah proposes

another algorithm PD scheduling with a complexity of O(minmlogn, n), which replaces tie-

procedure of the PF algorithm with the more efficient one. According to the new tie-break

procedure, two new categories of tasks are defined as heavy and light tasks. In addition, the PD

algorithm associates a pseudo-deadline with each allocation of every light task, non-allocation


of every heavy task. Moreover, the state of the PD algorithm is characterized by the tableau

keeping count information of the number of associated pseudo-deadlines satisfied up to this

point for each future slot and weight class [Baruah et al., 1995]. Despite the computational,

algorithmic efficiency of the PD algorithm, complexity of the algorithm encourages other re-

searchers to develop simpler algorithms.

Anderson and Srinivasan [Anderson and Srinivasan, 1999] manages to simplify the PD

algorithm further to the version, which includes one intermediate deadline and two tie-break

parameters. This proves the necessity for these two tie-break parameters to have a feasible

task allocation. Moreover, Anderson and Srinivasan reveal that consecutive window overlaps

of subtasks do not occur only at job boundaries. In other words, PD or PF algorithms use

the scheduling decisions given in each release of task T, so called job of T, for task priorities

for all subtasks of the job, or throughout the task period. Hence, there is no change in PF

priorities of tasks during task interval; however, this might create inefficiency. Considering this

possibility, Anderson and Srinivasan [Anderson and Srinivasan, 1999] introduce the swapping

argument, which discusses the systematically interchanging pair of subtasks on the basis of

priority definition during the task period. The feasibility and correctness of their swapping

argument and of their simplified priority scheme are also proved in their article in detail;

simulation of their simplified and revised algorithm is performed on a dual core environment.

In my opinion, the most innovative element in their work is the systematical interchanging of

subtask pairs throughout the task period.

The second paper of Anderson and Srinivasan [Anderson and Srinivasan, 2000] second

paper addresses the work conserving problem of the PD algorithm. In Pfair scheduling, each

subtask of a task T must bounded within a ”window” of time slots, and the last slot of this win-

dow represents the deadline of the subtask that belongs to task T. However, if some subtask of T

is completed ”early” within its window, then T is ineligible for execution until the beginning of

the next window of its next subtask [Anderson and Srinivasan, 2000]. This is against the work-

conserving principle. A scheduler is work-conserving if and only if it never leave the processor

idle in case there exists an uncompleted job in the system exists that could be executed on

that processor. Anderson and Srinivasan [Anderson and Srinivasan, 2000] introduce the ERfair

scheduling algorithm, which alters Pfair algorithm in such a way that if two tasks are part of the

same job, then the second subtask becomes eligible for execution as soon as the first completes.

In this way, the early release version of PD will have several advantages over PD. First of all,

2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 27

average job response time is improved, especially in lightly-loaded systems. Moreover, the

runtime cost is decreased since there is bookkeeping information required by PD regarding

eligibility of task that is not considered in the Early Release PD algorithm (ER-PD) [Anderson

and Srinivasan, 2000]. In summary, Anderson and Srinivasan [Anderson and Srinivasan, 2000]

contribute to the multiprocessor scheduling with the work-conserving scheduling algorithm ER-

PD which can optimally schedule periodic tasks on a multiprocessor system in a polynomial

time.

Despite the efficiency and optimality of the real-time multiprocessor scheduling algorithms

in the allocation of tasks to multiple cores or processors, the main goal of these algorithms is

to optimally share the processor time-related resources. In other words, these algorithms are

time-centric scheduling algorithms, and their capabilities are limited within the time-related

resources. These algorithms fail to consider the factors such as cache memory resources, which

have a significant impact on the scheduling efficiency. In this case, cache aware scheduling

algorithms, which gains cache-awareness to the existence time-centric real-time multiprocessor

scheduling algorithms, are developed.

2.4 Cache-Aware Multi-Core Chip Level Multiprocessor Scheduling

Cache-aware multi-core chip level multiprocessor scheduling is addressing the incapability of

real-time multi-core multiprocessor scheduling algorithms to analyze memory related issues,

such as access patterns of threads, and memory stalls. Cache-Aware scheduling can be clas-

sified into three groups: cache-fair thread scheduling, cache pattern-aware thread scheduling

(placement) and adaptive cache-aware thread scheduling. Due to the relevance to our project,

only cache-fair and adaptive cache-aware thread scheduling are covered in this section.

2.4.1 Cache-Fair Multi-Core CMP Scheduling

The cache-fair thread scheduling algorithms take advantages of the fact that fair sharing of

resources among threads will minimize the resource conflicts and bottlenecks. In the context

of resource allocation, Jain et. al. [Jain et al., 1998] discriminate between the algorithms

as unfair and fair, and explain fairness in terms of quantitative formulas and metrics. Jain

et. al. [Jain et al., 1998] defines fairness as a two step approach. The first step is to select


the appropriate metric depending on the application domain. The second step is to define

the fairness index which measures equality of resource allocation. In fact, fairness does not

necessarily correspond to equal distribution, it is completely dependent on the definition of

the fairness index. In addition, Jain et. al. [Jain et al., 1998] underscore the fairness index

properties as scale and metric independence, size independence, boundless, and continuity. In

this context, size independence refers to the fact that the number of elements sharing resources

is independent of the fairness index; and boundless refers to the fact that fairness index should

not be limited in value; and scale and metric independence denote the independence metrics

values or scales. Following the guidance of, the cache-fair scheduling algorithms deal with the

fairness in cache allocation.

Fedorova et. al. [Fedorova et al., 2006] apply the fairness concept in the cache allocation

problem on multi-core platform to reduce the effect of unequal multi-core processor cache

allocation. The architecture of multi-core processors enables multiple application threads to

run concurrently on a single processor chip. Multi-core multiprocessors generally use Level

2 (L2) shared cache memory, shared by concurrently running threads or co-runner(s). In such

architecture, cache sharing depends only on the cache demands of the co-runner(s), so it is

inevitable that unfair cache allocation takes place. A thread cache allocation affects the cache

miss rate, which has direct impact on cycle per instruction rate(CPI). CPI refers to the rate

thread retires instructions. Hence, a thread’s CPU performance is considerably dependent on its

co-runners. Fedorova et al. [Fedorova et al., 2006] summarize problems due to the co-runner

dependent performance variability as:

Unfair CPU Sharing A thread forward progress is dependent not only on its time slice share

of the CPU, but also the cache behaviors of its co-runner(s).

Poor Priority Enforcement In the case that a high priority job is scheduled with co-runner(s)

with high cache occupancy, it will experience poor performance rather than high perfor-

mance.

Inadequate CPU accounting On grid-like systems where users are charged for CPU hours,

the amount of computation performed varies depending on the co-runners; so, charging

is not appropriate.

In order to tackle these problems, Fedorova et. al. [Fedorova et al., 2006] proposes to


provide fairness in thread CPU latency. In other words, no matter what the cache allocation is,

the thread CPU latency will be the same as the CPU latency in cache-fair condition. Namely, if

the thread’s performance degrades due to the unfair cache sharing, the algorithm redistributes

CPU cycle time to threads and gives more time to that particular thread. In fact, the additional

CPU time given to the thread is actually taken away from other threads. Therefore, Fedorova

et al. [Fedorova et al., 2006] define two classes of threads in the system: cache-fair class and

best-effort.

Indeed, the approach of Fedorova et. al. [Fedorova et al., 2006] reduces the co-runner-

dependent performance variability for threads only in the cache-fair class. Figure 2.3 illustrates

CPU latency performance versus cache allocation for three different scenarios:

1. In the first scenario, the cache is shared unequally among threads, and unfortunately

Thread A gets less cache space compared to other threads. In this case, if the conventional

scheduler is used to schedule the threads, then it is inevitable that the thread A CPU

latency is more than equal sharing scenario.

2. The second scenario refers to an ideal case where cache is shared equally. In this case,

Thread A CPU latency is considered as an ideal CPU latency.

3. In the third scenario, similar to the first scenario, the cache is not shared equally and

Thread A gets a less cache space. However, in contrast to the first scenario, the cache-fair

scheduler SHARP is used. In this case, the Thread A is allocated extra cache space taken

over from Thread B in order to ensure that Thread A has the same CPU latency as in the

second scenario where cache allocation is fair among threads.

In the the article, the fairness index is in fact considered as a cache miss rate deviation

from the cache-fair miss rate. The derivation of cache-fair miss rate of thread T is based on the

assumptions: 1) Cache-friendly co-runners have similar cache miss rates, and 2) the relationship

between co-runners’ miss rates is linear [Fedorova et al., 2006]. If thread T run with several

different co-runners, then based on the relationship between Thread A and its co-runners, miss

rate of the Thread A is estimated with a cache-friendly co-runners, and this miss rate will be


Figure 2.3: Thread CPU Latency vs Cache Allocation for Different Scenarios [Fedorova et al.,2006]


cache-fair miss rate. This method is formulated by Fedorova as below:

Relationship between co-runners miss rate is linear

MissRate(T ) = a×n∑i=1

MissRate(Ci) + b, (2.1)

where T is a thread for which fair miss rate is computed, and Ci is the ith co-runner, n is the

number of co-runners, a and b are the linear equation coefficients. Since thread T experiences

its fair cache miss rate FairMissRate(T) when all concurrent threads experience the same miss

rate:

FairMissRate(T ) = MissRate(T ) = MissRate(Ci), (2.2)

Then the Equation (2.1) can be expressed as

FairMissRate(T ) = a× n× FairMissRate(T ) + b,

=b

1− a× n. (2.3)

Hence, deriving unknowns a, b in the Equation (2.1) from known miss rate of thread T and its

co-runner(s) Ci is sufficient to calculate fair miss rate of thread T using the Equation (2.3).

After successfully estimating the fair miss rate of the thread T, the next step is to derive

the fairness metric, the thread’s CPU latency. Since thread CPU latency is the product of its

cycle per instruction(CPI) and its share of CPU time, it is possible to define fairness metric as

CPI [Fedorova et al., 2006]. Using CPI metric, the cache-fair scheduling algorithm proposed

by Fedorova equates to the fairness metric CPI to CPIFair, ideal value by adjusting its share

of CPU time or CPU cycle. Fedorova et. al. [Fedorova et al., 2006] provide the following

analytical model to perform this operation:

CPI = CPIIdeal + L2CacheStalls, (2.4)

where CPIIdeal is the CPI when there are no L2 cache misses, L2CacheStalls is the per-

instruction stall time due to handling L2 cache misses.

The fair CPI is:

FairCPI = CPIIdeal + FairL2CacheStalls, (2.5)


where FairCPI is the CPI when the thread experiences its fair cache miss rate. In order to

estimate FairCPI, the CPIIdeal and FairL2CacheStalls terms are determined. To derive the

CPIIdeal, the Equation (2.4) is used. Actual CPI and the statistics, which are required to derive

the thread actual L2CacheStalls, are measured using hardware counters; then simple subtraction

operation will give CPIIdeal.

To compute FairL2CacheStalls, firstly, L2CacheStall is represented as a function of cache

miss rate, memory stall time MemoryStalls and the store buffer stall time StoreBufferStalls

[Matick et al., 2001].

L2CacheStalls = F (MissRate,MemoryStalls, StoreBufferStalls). (2.6)

FairL2CacheStalls is also derived using the same function, and same parameters as MemoryS-

talls and StoreBufferStalls, but with FairMissRate which is estimated by the Equation (2.3).

FairL2CacheStalls = F (FairMissRate,MemoryStalls, StoreBufferStalls). (2.7)

After deriving the fairness metric and index, Fedorova et al. [Fedorova et al., 2006] construct

the cache-fair algorithm as a two phase algorithm and each cache-fair thread goes through two

phases. The first phase, so called Reconnaissance phase, is the initial phase where fairness

metric, fair cache miss rate, and inputs for fairness index, fair CPI model are generated. Namely,

inputs for our fair CPI model are (1)actual CPI (2)the actual L2 cache stalls (L2CacheStalls),

and (3)fair L2 cache stalls (FairL2CacheStall). These inputs are obtained respectively:

1. Actual CPI is computed by getting a ratio of the number of CPU cycles T has executed

and the number of instructions T retired.

2. The actual L2 cache stall timeL2CacheStalls is estimated using the model given in the

Equation (2.6), and the inputs, cache miss rate, memory and store buffer stall times, are

retrieved from hardware counters.

3. FairL2CacheStalls is estimated using the model defined in (2.7). The inputs are retrieved

using hardware counters.

Once the reconnaissance phase is completed, thread T moves into the calibration phase,

where the scheduler periodically redistributes CPU time. Single calibration of a thread T


involves the calibration of the T’s CPU quantum (time) based on the fairness metric which

is the difference between actual and fair CPI ratio. In case the calibration of T’s CPU quantum

requires additional CPU resources, the scheduler picks the best-effort thread, some of which

resources are transferred to the thread T. Regarding the phase duration and repetition, after

the initial reconnaissance phase which is designed as 100 million instruction long, the cache-

fair thread moves into the calibration phase where thread’s CPU quantum are adjusted every

ten million instruction interval. The more frequent the adjustment is, the more responsive the

algorithm will be. After the initial reconnaissance phase, the following reconnaissance phase

is repeated every one billion instruction period. The reason the reconnaissance phase period is

incremented is based on the observation that L2 cache access patterns gradually and infrequently

change [Fedorova et al., 2006].

Fedorova et al. [Fedorova et al., 2006] also set up test cases with different workloads and

verified the significant drop in co-runner dependent performance variability for the cache-fair

threads. The paper of Fedorova et al. has a major impact on our cache aware adaptive closed

loop scheduling framework.

Besides the contribution Fedorova et al., Kim et al. [Kim et al., 2004] present a significant

study in fair cache sharing and partitioning. This paper makes several contributions. First of

all, it proposes a five cache metric which measures degree of fairness in cache sharing, and

evaluates these metrics and shows the correlation between these metrics. Kim et al. [Kim et al.,

2004] define the ideal fairness as an execution time fairness which is only achieved when all

co-scheduled threads have equal slowdown Tshr

Tded:

Tshr1Tded1

=Tshr2Tded2

=Tshr3Tded3

= · · · = TshrnTdedn

, (2.8)

which is referred to as execution time fairness criteria. The criteria can be met if, for any pair

of threads i and j co-scheduled, the following metric (M ij0 ) is minimized:

M ij0 = |Xi −Xj| , (2.9)

where Xi =Tshri

Tdedi. However, it is very difficult to measure Tshri due to lack of reference point

during the execution where the comparison of the dedicated and shared cache execution time

can be measured and compared. Hence, metrics, which are easier to measure, are defined and


the ones which highly correlate with M ij0 are used. Kim et al. [Kim et al., 2004] define five

metrics for any co-scheduled threads i and j:

M ij1 = |Xi −Xj| where Xi =

MissshriMissdedi

, (2.10)

M ij2 = |Xi −Xj| where Xi = Missshri , (2.11)

M ij3 = |Xi −Xj| where Xi =

MissrshriMissrdedi

, (2.12)

M ij4 = |Xi −Xj| where Xi = Missrshri , (2.13)

M ij5 = |Xi −Xj| where Xi = Missrshri −Missrdedi

, (2.14)

By minimizing any of these metrics, a fair cache algorithm aims to improve fairness. When

there are more than 2 co-scheduled threads, the fairness metrics are averaged over all possible

pairs Mx =∑

i

∑jM

ijx and this sum is minimized to improve fairness. According to Kim et

al. [Kim et al., 2004], each metric has its way of providing fairness; for instance, metric M1

tries to equalize the ratio of miss increase of each thread, while M2 tries to equalize the number

of misses. Likewise, M3 seeks to equalize the ratio of miss rate increase of each thread, while

M4 seeks to equalize miss rates, and M5 tries to equalize the increase in the miss rates of each

thread [Kim et al., 2004]. According to how well these metrics correlate with the execution time

fairness M0 (the Equation (2.9)), one of these metric substitutes M0 to guide a cache fairness

policy.

Secondly, using these metrics, static and dynamic L2 cache partitioning algorithms optimiz-

ing fairness are proposed. In static L2 cache algorithm, stack distance profile is used to predict

parameters of any selected metric Mx. Afterwards, the partition, which minimizes the chosen

metric Mx, is applied from all possible partitions during the co-schedule. However, the stack

distance profiling can only be used with LRU cache replacement policy, and can be obtained

statically by compiler or simulation. Hence, the static L2 cache partitioning algorithm relies

on the caches to utilize the LRU replacement algorithm. In contrast, the dynamic L2 cache

partitioning algorithm provides a complicated algorithm consisting of three parts: initialization,

rollback and repartitioning. In the initialization step, cache is equally partitioned among cores.

Following this step, rollback step reverses a repartitioning decision which is not beneficial by

comparing miss rate statistics of before and after repartitioning decision. In the repartitioning

step, for each thread, the fairness is computed, and small cache blocks depending on partition


granularity are reallocated till cache fairness is provided. These three steps are repeated periodi-

cally, but the period length is kept long enough to observe the full effect of cache repartitioning.

Regarding algorithm overhead, L2 dynamic cache algorithm includes profiling and storage

overhead which is a number of registers per thread to measure miss counts; and Kim et al.

claim that overhead is less than 0.01% compared to the smallest time unit.

Kim et al. [Kim et al., 2004] perform a simulation to evaluate the performance efficiency and

operability of their algorithm. According to the simulation results, metrics M1, M3, M4 and M5

produce good correlation with the execution time fairness metric M0. Regarding the throughput

and fairness relation, Kim et al. [Kim et al., 2004] conclude that optimizing fairness usually

increases throughput, while maximizing throughput does not necessarily improve fairness. This

article provides us with a deeper insight into cache-fairness metrics, stack-distance profiling,

how partitioning algorithms can be used to enforce cache-fairness, and the correlation between

cache-fairness and throughput efficiency.

Zhou et al. [Zhou et al., 2009b] address the weakness of the cache-aware algorithms in

ensuring performance fairness. This article introduces a model to examine the performance

impact of cache sharing, and proposes a mechanism of cache-sharing management to provide

performance fairness for concurrently executing applications. According to Zhou et al. [Zhou

et al., 2009b], there are two factors having impact on the correlation of cache miss fairness and

performance fairness:

1. The performance sensitivity of the application to cache misses is dependent on a fraction

of execution time stalled on cache misses. The fractions of cache miss stalls are various;

for instance, data centric applications will spend many more cycles on memory access

than computation oriented applications.

2. The stall of each miss also differs according to Memory Level Parallelism[MLP], Instruc-

tion Level Parallelism[ILP]. For instance, misses’ latency cycles may overlap with recent

misses such that a data request occurred at higher level cache is directly issued to lower

level cache, shared L2 cache, and L2 cache may supply the data from previous requests.

In this way, cache miss penalty can be reduced. Moreover, computation operation cycles

may overlap with memory access latency cycles. In other words, while job is waiting

for computational operation being executed, meanwhile data can be supplied from the

memory.Hence, the actual penalty for cache miss will be reduced.


Considering all these interrelations between cache misses and performance fairness, Zhou

et al. [Zhou et al., 2009b] define two fairness metrics: a performance fairness metric Mperf of a

pair concurrently running workloads i, j under certain cache partition, and cache miss fairness

metric Mmiss. Performance fairness Mperf and cache fairness Mmiss are achieved when Mperf

and Mmiss equal to zero, respectively.

Mperf =∑i

∑j

∣∣∣∣∣CPIshri

CPIdedi

−CPIshrj

CPIdedj

∣∣∣∣∣ , (2.15)

Mmiss =∑i

∑j

∣∣∣∣∣MPKIshri

MPKIdedi

−MPKIshrj

MPKIdedj

∣∣∣∣∣ . (2.16)

While Mperf indicates the sum of the slowdown difference among every two co-scheduled

workloads, Mmiss indicates the sum of the ratio of miss increase count differences between

every two co-scheduled workloads. The smaller Mperf and Mmiss are, the better performance

fairness and cache fairness are achieved.

Considering the correlation of cache miss fairness and performance fairness discussion

2.4.1, Zhou et al. [Zhou et al., 2009b] provide an innovative model to quantify the performance

impact of cache sharing. Firstly, the execution cycles of the application are categorized into

two classes: private operation cycles (Tpri) and vulnerable operation cycles (Tvul). Private

operation cycles are execution cycles, which only depends on the characteristic of workload;

that is, no other external factors like shared cache latency or memory access latency are involved

during the execution of these cycles. These cycles are consumed by operations which need the

private resources of the core; so these cycles are dependent on private resources and related

latencies such as computation time, fetching instruction and data latencies. However, vulnerable

operation cycles Tvul are sensitive to other external factors such as different co-schedulers on

the other cores, off-chip memory access latency and other off-chip related latency or factors.

However, especially in out-of-order processors, even instruction is stalled by L2 cache miss,

although other instruction can still be fetched into the pipeline if there is no data dependence.

Hence, during these cycles actually both off-chip memory access and on-chip instructions are

simultaneously performed, in other words, overlapped. These overlapped cycles are called


overlap cycles (Tovl). As a result, total cycles T is defined as

T = Tpri + Tvul − Tovl, (2.17)

where Tvul is equal to

Tvul =

Nmiss∑i=1

MissPenalty

= Nmiss ×MissLatency

MLPavg, (2.18)

Then the Equation (2.17) can be written as

T = Tpri − Tovl +Nmiss ×MissLatency

MLPavg, (2.19)

where MLP refers to the average number of useful outstanding off-chip request.

Using the fairness metrics and cache performance impact model, Zhou et al. [Zhou et al.,

2009b] build a hardware framework which enforces the fairness metrics. The hardware platform

includes two interdependent parts: cache partitioning mechanism and hardware profiler. The

cache partitioning mechanism is used to tag cache block to indicate which core use this block,

and for N core, tag will be log2N bits long. In addition, there will be a single bit flag overAlloc

which indicates whether core i has been allocated too much cache space. This mechanism

ensures that over allocated cores cannot gain additional cache lines, and the additional cache

blocks are used by under allocated cores. In other words, cache fairness is enforced.

Hardware profiler addresses the performance fairness for a shared cache environment by

enforcing identical slowdown for both workload running in a shared cache and workload run-

ning in a dedicated cache environment. In this context, the hardware profiler is responsible

for profiling necessary runtime or statistical parameters when applications are executing con-

currently, and it uses the Equations (2.17) and (2.19) to estimate the situation if running with

dedicated cache. After collecting necessary runtime and statistical parameters, fairness metrics

Mperf and Mmiss, given in the Equation (2.15) and (2.16) respectively, are used to evaluate the

fairness of the cores. Finally, the core with the least slowdown is decided as being allocated

too much cache space and the core overAlloc flag is set. Despite the fact that Zhou et al.

provide a complete solution which guarantees both performance and cache fairness on CMP

systems, there is a comparably significant hardware cost: Auxiliary Tag Directory (ATD),

which has same associativity as the main tag directory of the of shared L2 cache, is used to


estimate some cache parameters for the hardware profiler. To achieve this, it actually attaches a

virtual ”private” cache to each core. In addition, for this purpose, auxiliary Miss Status Holding

Register (MSHR) is added to each ATD.

Ebrahimi et al. [Ebrahimi et al., 2010] propose a new approach that enforces the overall

shared memory system fairness, rather than the separate fairness mechanism for each individual

resources. In this way, the necessity of a complexity fairness mechanism which independently

implements fairness for each resources is eliminated. In addition, the approaches implementing

separate fairness mechanisms can make contradictory decisions, leading to low fairness and

loss of performance. Ebrahimi et al. [Ebrahimi et al., 2010] develop a low-cost architectural

technique, which allows system level fairness policies, also called as source-based fairness

control, which can be applied to entire memory systems in CMPs. In order to achieve this,

Ebrahimi et al. [Ebrahimi et al., 2010] propose a two step mechanism:

1. In the first step, dynamic feedback information of unfairness in the systems is collected.

At this stage, system fairness is continuously monitored.

2. The next step will be using this information to dynamically alter the rate at which the

different cores make resource requests to the shared memory subsystem such that system-

level objectives are met.

To calculate the unfairness in the system, execution time slowdown metric is defined Tshared

Talone,

where Tshared is the number of cycles for an application to run concurrently with other applica-

tion, and Talone is the number of cycles which would have been spent for application running

alone. However, it is a challenge to obtain information on Talone of the application while it is

simultaneously running with other applications. Hence, Ebrahimi et al. [Ebrahimi et al., 2010]

define Texcess as the number of extra cycles, which are required for an application to execute

due to inter-core interference in the shared system; and, to estimate Talone, Texcess is used in the

equation Talone = Tshared − Texcess [Ebrahimi et al., 2010]. In order to estimate Texcess, it is

necessary to keep track of the inter-core interference each core incurs, and for this purpose each

core has an InterferencePerCore bit vector for the purpose of indicating whether or not the core

is delayed due to inter-core interference. When interference is detected by cache miss counter

or memory controller, InterferencePerCore is set and the Texcess counter is incremented for each

core.


If the unfairness is larger than the unfairness threshold set by the system software, then

the core, which has the largest slowdown, is throttled down. In order to throttle down the

memory request from the core, Miss Status Holding/Information Registers (MSHRs) can be

used in a way that increasing/decreasing MSHR entries for a core proportionally affect the rate

of memory request injection from the memory into shared memory [Ebrahimi et al., 2010]. If

continuous fairness monitoring indicates that the system software goal has been met again, then

all cores are allowed to throttle up to improve system performance again by controlling MSHR.

Ebrahimi et al. [Ebrahimi et al., 2010] evaluate the performance of their technique on both

2-core and 4-core CMP systems and experimental results indicate that 25.6% to 14.5% system

throughput and 44.4% to 36.2% system unfairness reduction are achieved with the developed

technique with a low cost. This paper has a significance in our research since it is one of the

recent papers in the field, and it also provides a simple and efficient solution at the system level.

Jahre and Natvig [Jahre and Natvig, 2009] propose a novel fairness mechanism, called the

Dynamic Miss Handling Architecture (DMHA). The DMHA uses a single fairness enforcement

policy for the complete hardware-managed shared memory, that reduces the implementation

complexity significantly. DMHA facilitates Miss Status Holding Registers(MSHRs) available

in the private Level 1 (L1) caches to determine the number of misses the cache can sustain

without blocking. MSHRs can acknowledge the processor before cache blocked so that the

processor will reduce its execution speed since it is unable to fetch more data. Jahre and

Natvig uses this fact to set the execution speeds of the processors such that the slowdown due

to memory system interference is equalized across threads. In fact, Jahre and Natvig [Jahre

and Natvig, 2009] explain the whole approach as Fair Adaptive Miss Handling Architecture

(FAMHA) consisting of three step measurement, allocation and enforcement. For the measure-

ment of interference in a shared resource environment, Jahre and Natvig [Jahre and Natvig,

2009] propose the notion of Interference Points (IPs) which consider three types of shared

source interference: crossbar delay, shared cache contention and memory bus interference.

According to the metrics for each interference type, total interference points are calculated.

Then, a simple fairness policy decides whether or not any reduction in the execution speed is

necessary. In case the decision of fairness policy is in favor of enforcing processors to cut back

their execution rate, slowdown is equalized across the processors. For this purpose, DMHA

divides the total miss bandwidth between all cores at the end of private memory system, L1

private caches. That is, miss bandwidth per thread is allocated by controlling the number of


available MHSRs at runtime. Hence, reducing the number of miss bandwidth, that is number

of MHSRs, slows down processor or core execution speed. To deactivate MSHR, extra bit field

usable bit(U) is created, in addition to conventional MSHR fields, block address of the miss

(Private Cache L1), some target information and valid bit(V). Namely, if usable bit(U) is set,

then MSHR is deactivated and can only be used to store miss data. This reduces the number of

shared memory access requests, and as a result, enforces the processor to reduce the execution

speed.

Using this light-weight fairness mechanism on chip level multiprocessor memory systems,

Jahre and Natvig improve fairness by 26% on average with the single program baseline, and

throughput by 13% on average when it is compared to conventional memory systems. This

paper introduces a simple and reasonable approach to the existing scheduling problem. In

addition, it provides a detailed insight to the Miss Handling Architectures (MHA) 1 and MSHRs

[Ebrahimi et al., 2010].

All in all, there are a variety of approaches to the resource allocation fairness problem on

CMP. The general strategy can be formulated into three stages: (1)Measurement (2)Allocation

(3)Fairness Enforcement. Algorithms are diversified according to their strategies mainly in one

of these three sub-stages. For instance, while the Jahre and Natvig article uses Interference

Points(IPs) [Jahre and Natvig, 2009] to measure unfairness, Fedorova et al. [Fedorova et al.,

2006] identify fairness based on the CPU latency or CPI. However, the general inclination is

to use execution time metric [Ebrahimi et al., 2010, Kim et al., 2004]. Regarding fairness

enforcement strategies, there are different approaches. For instance Fedorova et al. [Fedorova

et al., 2006] enforce fairness by adapting CPU cycles of thread. However, the general trend is

to enforce fairness using cache related parameters such as allocation and partition [Ebrahimi

et al., 2010, Jahre and Natvig, 2009, Kim et al., 2004, Zhou et al., 2009b].

2.4.2 Adaptive Cache-Aware CMP Scheduling

The existing cache-aware multi-core scheduling algorithms are generally static open-loop al-

gorithms which cannot adapt themselves dynamically against dynamically changing processor

operating states such as cache allocation. Hence, the efficiency of the algorithms is validated

on specific operating environments; and the algorithm performance degrades to the extent the

1For more information, refer to Kroft article[Kroft, 1981].


system state or parameter distances itself away from the ideal operating point. As a result,

there has been a necessity of closed loop algorithms replacing traditional open loop algorithms

in such operating environments where system state and parameters are subject to significant

variations. In the context of multi-core thread scheduling, L2 shared cache behavior can also be

identified as highly unpredictable and uncertain due to the highly correlated nature of co-runner

threads. Tam et al. [Tam et al., 2009] simulation results on a number of application indicates

that the cache miss rate curves might show a significant difference for each of applications or

threads; so, it is really hard to predict thread data reference pattern as well as cache miss rate.

Hence, cache access pattern itself introduces a significant variation to the operating platform.

In such an unpredictable operating environment, the closed-loop (feedback) control the-

ory ensures that the system adapts these system state changes at each control period. This

is achieved by setting a reference operating state and regulating the error between reference

operating and measured operating towards zero. In summary, control theory provides optimality

guarantees and precise reasoning, which is a way of detecting how far the system is from the

optimization objective, in terms of system parameters. Moreover, control theory assures quick

adaptive response to changes in dynamic behavior of the system parameters as in cache miss

parameters. In spite of these advantages, there are some challenges in the implementation of the

control theory. First of all, formulating the system behaviors is not an easy task especially for

complex systems . Secondly, the stability for the closed loop system is a major concern because

an open loop system stability might not ensure overall closed system stability.

Srikantaiah et al. [Srikantaiah et al., 2009] propose formal feedback theory in dynamic

shared last level partitioning on CMPs. Feedback control theory is used to optimize the level

cache space utilization among multiple concurrent applications with a well defined service

objectives. That is, adaptive feedback control based cache partitioning scheme is applied

to achieve service differentiation among applications with minimum impact on fair speed-up

metric, while ensuring the fair speed-up of the applications. Srikantaiah et al. [Srikantaiah

et al., 2009] underscore the necessity of QoS at application level which addresses lack of control

over management of shared resources; and offer cache partitioning as an effective service level

agreement method in shared-resource environments. One of the important global performance

objectives in a shared resource environment, the best utilization objective, is formulated as high


fair speedup metric for each workload:

FS(Scheme) = N/N∑i=1

IPCappi(base)

IPCappi(scheme)

, (2.20)

where N is the number of applications in the workload such as the set of applications running

concurrently, andIPCappi(base)

IPCappi(scheme)represents the improvement gained by the application i with the

fair speedup improvement scheme. In this sense, FS is actually the indicator of throughput as

well as of metric fairness.

In this context, Srikantaiah et al. [Srikantaiah et al., 2009] design SHAred Resource Parti-

tioning (SHARP) control architecture which considers both fair speedup improvement as well

as service differentiation among various applications objectives. SHARP control architecture

includes three main parts as shown in Figure 2.4:

Per-Application Controller A Per-Application Controller specifies the cache space allocation

necessary to meet the specified performance targets.

Pre-Actuation Negotiator A Pre-Actuation Negotiator handles the situations where total cache

demanded by Pre-Application Controller exceeds available cache capacity.

SHARP Controller A SHARP controller adapts the reference performance targets to the ap-

plications in a fair manner with respect to the total utilization of cache, so that all appli-

cations would be served in a fair manner for any cache utilization scenarios.

In this architecture, shared last-level cache is partitioned into a number of cache partitions

according to the number of the applications running on the system, and partition sizes (number

of cache ways) dedicated to applications (App1, · · ·AppN ) are referred as w∗1, · · ·w∗N . The total

number of cache ways demanded by the Application Controller (AppC) should be less than

or equal to the total available cache ways of the cache∑N

i=1wi ≤ W . While Application

Controller(AppC) determines the ideal source share as in number of way(wi), Pre-Actuation

Negotiator(PAN) ensures that total requested cache ways are not more than the available cache

ways W∑N

i=1wi ≤ W . If∑N

i=1wi ≥ W , PAN will adapt the requested cache ways(wi) by

AppC into the feasible cache ways w∗i . Based on the allocated cache ways w∗i , ith application is

executed; and at the end of CPU cycle, performance metric (IPC) P outi is monitored and fed into

SHARP controller. Using performance metric P outi and performance reference P ref

i , SHARP


Figure 2.4: SHARP Control Architecture [Srikantaiah et al., 2009]

controller set a new performance target P ∗i for the Application Controller (AppCi).

Application Controller(AppC) is the main part of the control framework, which determines

the number of cache-ways (wi) required to achieve the target performance P ∗i using some more

additional information: performance achieved in the previous time interval P outi (t− 1) and the

partition size wi(t−1). Srikantaiah et al. [Srikantaiah et al., 2009] propose a customized AppC

controller based on Reinforced Oscillation Resistant Controller 2, which tracks the history of

cache decisions made to alleviate the problem of variation in cache space allocations caused

by traditional PID controllers [Srikantaiah et al., 2009]. In addition, it is necessary for relevant

cache allocation decisions to have an efficient cache performance model. Hence, Srikantaiah et

al. [Srikantaiah et al., 2009] formulate the control law and cache performance model:

Ψi(t) =

−log(1− (

P ∗i (t)

ϕ(i∞)

))

αi

, (CachePerformanceModel)

(2.21)

∀i, wi(t) = wi(t− 1) + Ψi(t)−mi(t− 1), (ControlLawforAppC(ROR)controller)

(2.22)

where ϕi(∞) is the instruction per cycle (IPC) of the application i when it is allocated maximum

2ROR controller uses similar design as in predictive controller, so please refer to Camacho and Bordons bookon MPC [Camacho and Bordons, 1999].


number of cache ways (theoretically, Ψi(t) refers to predicted value of number of cache ways

for application i to catch target performance P ∗i (t), wi = ∞ ); the moving average mi(t − 1)

is the previous values of wi determined using the same control law, αi is the parameter that

approximately determines the utility of each additional cache ways that might be allocated to

application i. Here,αi and ϕ(i∞) are cache model parameters, which are learned online and

updated with continuous allocations according to the application’s data pattern behaviors.

The Pre-Actuation Negotiator(PAN) is used to obtain a feasible partition

{w∗1, w∗2, w∗3, · · · , w∗n} in the W-way partitioned shared cache, and PAN derivation of a feasible

partition is based on the demands {w1, w2, w3, · · · , wn} of AppC controllers in cases∑N

i=0 wi >

W . PAN can also implement different policies throughout this process to enforce fairness and

service differentiation. Srikantaiah et al. [Srikantaiah et al., 2009] apply two sample policies in

their paper:

Fair Speedup Improvement(FSI) This policy is to keep the fair speedup metric of the work-

load as high as possible. For that purpose, a total number of excessive ways recovery from

the workload is determined to specify a feasible partition as spillw =∑N

i=0(wi) −W .

Then for each application, a feasible number of ways w∗i is defined proportionally to the

demanded number of ways w:

w∗i =

⌊(wi(1−

spillw∑Ni=0wi

))

⌋, (2.23)

Service Differentiation(SD) Srikantaiah et al. apply service differentiation policy (SD) pro-

viding service differentiation among applications in case the cache space demand exceeds

available cache space, in a way that the higher priority applications are favored in the

recovery of cache-ways spillw.

SHARP controller aims to increase the utilization of the cache in case target IPCs are less

than the total number of available cache. Therefore, the reference IPCs for the AppC controllers

{P ∗1 , P ∗2 , P ∗3 , · · · , P ∗N} will be recomputed in a fair way if FSI is used in the PAN, such that

additional W −∑N

i=0(wi) cache ways are distributed fairly as:

P ∗i (t) = P refi

W∑Nj=0(

w∗j (t−1)

P outj (t−1)

P refj )

, (2.24)


If the SD policy is used in the PAN, then weight factor {Θ1,Θ1, · · · ,ΘN} is also considered

during the distribution of cache ways:

P ∗i (t) = P refi

W∑N

j=0 Θj∑Nj=0(

w∗j (t−1)

P outj (t−1)

P refj ·Θj)

, (2.25)

In summary, SHARP control architecture guarantees optimal cache space utilization, fair

speedup improvement, and service differentiation among concurrently running applications

in a shared last level cache environment in CMPs. According to Srikantaiah et al., the fair

speedup scheme achieves 21.9% improvement on the fair speedup metric and, provides a well

regulated service differentiation on 2-core and 8-core systems. This paper has a significance in

our research work especially from an architectural and cache performance modeling point of

view.

There are a another sets of articles on feedback control theory from Brinkschulte and Pacher

[Brinkschulte and Pacher, 2008], and Velusamy et al. Velusamy et al. [2002]. Brinkschulte

and Pacher [Brinkschulte and Pacher, 2008] propose a closed feedback loop to control the

throughput (IPC rate) of a thread running on simultaneous multi-threaded microprocessors.

The thread synchronization and branch mis-prediction are considered as the factors which cause

throughput(IPC) of threads to fall behind ideal aimed throughput. In such highly unpredictable

environments, Brinkschulte and Pacher propose a feedback closed loop framework considering

only branch mis-prediction as a slowdown factor for the throughput of a thread. First of all a

relevant mathematical model is developed to define the relation between IPC and branch mis-

prediction.

IPC(n) =GP (n)

1 +MPR(n) ∗MPP, (2.26)

where GP(n) means a thread’s Guaranteed Percentage rate in the interval n, MPR(n) represents

a misprediction rate in the interval of n and, MPP is the number of penalty clock cycles. Then,

based on this mathematical model in the Equation (2.26), a new thread priority is defined for

a thread on each control cycle. That is, the thread actual IPC rate is monitored periodically,

and the difference between aimed IPC rate and actual value from the past instance is fed into P

controller in order to determine a new priority (the Guaranteed Percentage value(GP)) for the

thread.

GP (n) = P ∗ (IPCAimed − IPCC(n− 1)), (2.27)


where P is the proportional factor used by the controller to derive a new GP value. Then,

Brinkschulte and Pacher analyze and discuss convergence, settling time and steady-state value

of controlled IPC rate. This paper provides us a perspective on how to apply control theory on

a simultaneous multithreading (SMT) multiprocessor platform.

Velusamy et al. [Velusamy et al., 2002] provide a general guideline on a formal controller-

design techniques application to trace a desired adaptive behavior. This covers modeling the

underlying system behavior, choosing a setpoint (target metric), sampling rate, as well as

implementation controllers in hardware. Velusamy et al. demonstrate this guideline using a

concrete example, for which cache decay interval is chosen. Velusamy et al. [Velusamy et al.,

2002] define the cache decay as a technique for leakage-energy savings that waits for some

pre-determined time the decay interval before concluding that a cache line’s data is no longer

in use, and that line can be deactivated [Velusamy et al., 2002] 3. Most importantly, the paper of

Velusamy et al. is a good roadmap for applying the control theory on a computer architecture.

The paper includes a basic level discussion on mapping adaptive techniques onto control loops,

which covers control loop parameters such as setpoint, controller parameters, control input,and

simple digital control theory concepts such as proportional controller (P controller), integral

controller (I controller) and controller gain selection and stability analysis. In addition to that,

Velusamy et al. also enlighten the readers on establishing dynamic models, selecting sampling

rates and building practical systems. In our research, this paper will a good roadmap in mapping

our control strategy onto our scheduling strategy and scheduling framework.

In summary, to the best of our knowledge, control theory approach is not a common strategy

on shared computing resource scheduling problems, and even the ones discussed so far only

apply some basic level control tools onto our cache aware chip level multiprocessor scheduling

domain. However, these tools provide a significant efficiency in case system dynamics and

system operating conditions have a small variation around an operating point. However, in

a highly unpredictable operating environment where the system operating conditions, states

and dynamics have significant variations, then the simple feedback control algorithms fail.

However, the modern control theories are still applicable to such scenarios, and handle the

unpredictability of the system and the operating environment.

3Since leakage-energy saving in multiprocessor systems is out of our research scope, further detailed discussionwill be skipped.Please refer to Velusamy et al. for more detailed discussion [Velusamy et al., 2002].

2.5. MODERN CONTROL THEORY FOR SCHEDULING PROBLEMS 47

2.5 Modern Control Theory for Scheduling Problems

In parallel with the increasing system complexity, the control system theory develops new

algorithms, methods and even theories to tackle system challenges. As a result, the modern

control systems emerge. Paraskevopoulos [Paraskevopoulos, 2002] explains the main difference

between traditional and modern(advanced) control systems :

1. The classical control refers to the single input single output (SISO) systems, hence design

methods prefer graphical methods such as Bode, Nyquist and root locus rather than

advanced mathematical design approaches;

2. Modern control refers to complex multiple input multiple output (MIMO) systems, and

the design methods are analytical methods which require advanced mathematics.

Modern control theory ,which is based on emerging time-domain analysis capabilities and state

variables, has been developed in order to handle complexity of modern plants, and meet the

strict requirement on accuracy and cost [Ogata, 1997]. From 1960 to 1980, research on optimal

control for both deterministic and stochastic systems and on adaptive control of complex sys-

tems was conducted. Optimal control theory refers to the mathematical optimization technique

for deriving control policies; whereas, adaptive control refers to the modern control scheme

capable of handling time-varying system behaviors. According to Ioannou and Fidan [Ioannou

and Fidan, 2006], adaptive control consists of a parameter estimator, which estimates plant

parameter online, and a control law to control sets of plants whose parameters are completely

unknown and/or have an unpredictable behavior. On the basis of the Ioannou and Fidan defi-

nition, adaptive control schema is considered as a main framework of our thesis project.In this

context, the rest of the literature review focus on the adaptive control.

2.5.1 Adaptive Control

The history of adaptive control systems goes back to the early 1950s. Adaptive control theory

was proposed in the early 1950s in order to control the system with an unknown or slowly time-

varying characteristic. The first application domain was in the avionic industry in the design of

autopilots for high-performance aircrafts. Such aircraft have a wide range of operation speeds


and environments and conventional feedback controllers could only operate well in one oper-

ating condition, but not on the whole flight. Hence, the gain scheduling was the first proposed

adaptive control scheme believed to be a suitable technique for flight control systems. In the

1960s, development of state-space, stability and stochastic theory lead to better understanding

of adaptive systems; hence, Tsypkin proposed a common framework which unites system

identification (learning) with adaptive control. Following this, the major development in system

identification and estimation schemes, as well as different design methods are occurred in the

1970s. In the late 1970s and early 1980s, the proofs for stability of adaptive systems leads to

the idea of merging robust control and system identification; and this leads to robust adaptive

controls [Astrom and Wittenmark, 1994].

As explained above, adaptive control theory is developed to address inefficiency of con-

ventional feedback systems in controlling time-varying or unknown systems. That is, sys-

tems which have unknown plant, or of which plant parameters are subject to unpredictable

slow variations, are adaptive control application domains. For these systems, adaptive control

provides an adjustment mechanism to re-design the controller to meet control objectives and

specifications, and this process will be repeated on each time cycle. In this way, continuous

adaptation of the controller to the varying system characteristics will be achieved. In such

control systems, it is necessary to have two separate loops as in Figure 2.5: (1) normal feedback

with the process and the controller (2) parameter adjustment loop including online learning,

and controller adjustment [Astrom and Wittenmark, 1994]. So far, the system plant parameter

Figure 2.5: Block Diagram Adaptive System [Astrom and Wittenmark, 1994]

variation is defined as the decision threshold whether or not an adaptive control system is

required to control such system. In fact, linear feedback systems have also the ability to cope

with parameter changes to some extent. However, without a bound or limit, the stability of

the linear feedback system becomes a critical issue. Ioannou and Fidan [Ioannou and Fidan,


2006] explain the case where an adaptive control system is superior to linear control as follows:

Consider the scalar plant

x = αx + u, (2.28)

where u is the control input, x is the state of the plant, and α is unknown. In such system, aim

is to choose u such that the state x is bounded and stable with time. If α was known parameter,

then linear control law

u = −kx, k > |α|, (2.29)

would meet the control objective.

In fact, for an upper bound α ≥ |α|, the above control law can also meet the control objective

with k > α. However, if α changes in the range α > k > 0, then the control system becomes

unstable. In conclusion, in the absence of upper bounds for the plant parameter, there is no

linear controller to stabilize the plant [Ioannou and Fidan, 2006]. As mentioned previously,

the adaptive control systems facilitates two different loops as adjustment loop and traditional

feedback loop. Hence, it is necessary for the adaptive system to observe system operating

parameters, and even in some cases, online learning or estimation of plant parameters. In this

case, it is necessary to have enough data to obtain stable information about these parameters.

Afterwards, based on online learning or observation, the controller parameters are adjusted. In

this regard, it is obvious that the adaptive system response to the changing operating parameter

is not instantaneous and takes a few cycles to be completed. Hence, an adaptive control system

is able to track only slowly changing operating conditions or plant parameters.

As discussed previously, adaptive control systems are applicable to the operating environ-

ment or plants with boundless parameter variations and/or unknown parameters. In such cases,

the source of variations in real systems might occur due to the [Astrom and Wittenmark, 1994]

:

Nonlinear Actuators Actuators, such as valves, are a very common source of variation and

have a nonlinear characteristics.

Flow and Speed Variations The pipe and tank systems can be another source of variation in

the sense that flows are generated by the source, and so source production rates have

significant impact on flow. That is, the system dynamics of flows are changed as produc-

tion rate (operating point) is changed; hence, a controller should also be adapted for such


changes to have a better performance.

Flight Control The flight dynamics of airplanes change significantly with speed, altitude and

so on. Specifically, supersonic flight control systems such as autopilots and stability aug-

mentation systems create significant challenge to conventional linear feedback systems;

and this was the driving force in the development of adaptive systems.

Variations in Disturbance Characteristics Variation in disturbance characteristics can have

a significant impact on the controller performance, so does the variation in system dy-

namics. For instance, the design of an autopilot (control system) for ship steering must

consider the disturbance forces acting on ships such as wind, waves and current; and

these disturbance factors can have a significant variation due to unpredictable changes in

weather conditions. Hence, it is mandatory to adapt the controller parameters to cope with

such significant changes in disturbance characteristics of the system. This is applicable

to some flight control systems.

According to online estimation and parameter uncertainty compensation capabilities, adap-

tive control schemes can be categorized into three subcategories as the identifier-based adaptive

schemes which have online estimation capability but cannot compensate parameter uncertain-

ties; the non-identifier based schemes which support neither online estimation nor uncertainty

compensation; and the dual adaptive control, which facilitates both online estimation as hyper-

state estimation and uncertainty compensation.

Identifier-Based Adaptive Control Schemes

This class of adaptive control schemes combines the online estimator, which estimates the

unknown parameters of the plant at each instant of time, with the conventional control law. In

this class of schemes, the system plant is identified using online estimators at each control loop,

and based on the estimated plant model, the controller is designed using conventional controller

design methods for traditional feedback loops. The online estimation or system identification

process varies depending on the plant characteristic as:

Nonparametric Adaptive Control This scheme is based on the stochastic estimation algo-

rithms due to the fact that the plant has a stochastic characteristic rather than a determin-

istic one. In this case, the stochastic behavior of the plant is estimated.


Parametric Adaptive Control This class of scheme is based on the deterministic estimation

algorithms since the plant can be represented by parameters; and in that case these pa-

rameters are estimated. This class of scheme can be approached in two ways:

1. Indirect(Explicit) Adaptive Control

2. Direct(Implicit) Adaptive Control

The further discussion on parametric adaptive control schemes is provided in the third chapter

System Model Adaptive Control 3.

Non-Identifier Based Adaptive Control Schemes

The non-identifier based class of adaptive control schemes, in contrast to identifier based ones,

do not include online estimators. Instead, search method is utilized in order to find the appropri-

ate control parameters from the range of possible parameters; or alternatively switching between

different fixed parameter controllers can be facilitated with the consideration that at least one is

stabilizing; or utilizing multiple fixed models for the plants can be another alternative.

Dual Adaptive Control Systems

Adaptive schemes described so far do not consider the parameter uncertainties and plant un-

certainties. In this regard, dual adaptive control systems provide a solution for the uncertain

adaptive systems. Dual adaptive control scheme derives a solution from an abstract formulation

of the control problem as in the previous schemes, and uses optimization theory, specifically

nonlinear stochastic control theory for the optimal solution. The major contribution and advan-

tage of dual adaptive schemes is that the uncertainties in the parameters are taken into account

by the controller; however, it is a complicated approach for practical problems.

2.5.2 Parameter Estimation

Accurate online estimation of the process parameters has a significance in the adaptive control

theory, particularly in identifier-based adaptive control schemes, and dual adaptive control

systems. In this sense, parameter estimation in the general context is the key element in our

identifier-based adaptive scheduling scheme. According to Ljung [Ljung, 1998], parameter


estimation is actually searching for the parametric model, which is the best fit to a given

experimental data according to a given criterion from a set of parametric models. Parameter

estimation methods can be classified according to the source of statistical information, on which

model construction is based, prior knowledge and experiment data [Bohlin, 2006]:

White-Box Estimation White-box parameter estimation uses an invariant prior knowledge on

the system and its environment. The shortcoming of this identification approach is its

inability to handle the uncertainties and unpredictable changes in the system.

Black-Box Estimation Black-box estimation methods use statistical methods in order to pro-

cess experiment data and produce data description for the systems, which leads to the

selection of the suitable model. In this approach, the system uncertainties and changes

are explicitly considered as a part of the usual process; however, since the model is

completely based on the experiment data, the main concern is the inconsistency in the data

itself, which might lead to very different models throughout the iterative experiments.

Grey-Box Estimation Grey-box estimation methods use both prior knowledge and experiment

data to construct the most suitable model.

Regardless of this classification, Astrom and Wittenmark [Astrom and Wittenmark, 1994]

underline the key elements of parameter estimation to be heeded from the designer point of

view as:

Selection of Model Structure Selection of model structure involves the selection of a class

of models within which the most suitable model is selected. Depending on the system

complexity, linearity and nonlinearity, time characteristics such as time variance and

time invariance and mathematical representations of the system as in polynomial and

state space representation, different model structures can be constructed. In our research

project, a linear invariant discrete state space model is designed for the thread cache

behaviors.

Experiment Design The experiment design is considered as developing an experiment setup,

where the experiment variables including experiment metrics (which signals to measure),

experiment observation retrieval period (when to measure these signals) and input signals

(which input signals to be applied to system) are defined. The aim of the experiment


design is to retrieve maximal information from statistical information for the optimal

determination of experiment variables. In the foundation of the experiment design in the

parameter estimation, two significant experiment types, online experiments and offline

estimation have significance. In our experiment, online experiment is selected due to the

fact that our parameter estimation method is required to be functional while the system is

still running.

Estimation Phase Estimation phase is in fact the selection of the best model from a predefined

model set. In other words, best mapping of input data vector into the output data vector

with minimal error has been sought, and the set of parameters which map each component

of input to output is called parameter vector θ. In our research, the least square estimation

family algorithms are used as an estimation tool.

Validation The validation stage can be considered as the mechanism, which validates the

selection of the best model out of the model set.

In summary, this section provides an abstract introduction to the parameter estimation in the

context adaptive control scheme. The key design steps in the parameter estimation proce-

dures including selection of models, experiment design, parameter estimation and validation

are briefly introduced. Throughout this section, the adaptive control context and possible

application domain related questions are addressed to ensure the coherence of the parameter

estimation with adaptive control theory.

2.5.3 Control System Design Algorithms

In the context of adaptive control, control system design refers to the mathematical approach to

design a system with a specified behavior for a given design parameters. As mentioned previ-

ously, adaptive control theory is based on the underlying assumption that the process dynamics

and environment continually changes. In this case, it is not convenient to consider a static

controller. That is, each time recursive parameter estimation tools estimate system parameters,

it is necessary to re-design the controller based on these estimated system parameters. Hence,

it is necessary to have recursive controller design algorithms, which is capable of determining

controller parameters for a given system parameters.


The Control System Design approaches in the modern control theory can be categorized

under three main subcategories:

• Frequency Domain Control System Design Algorithms

• State-Space Control System Design Algorithms

• Robust and Optimal Control System Design Algorithms

In our adaptive scheduling framework, the state-space control system design algorithms is

used due to the compatibility with the adaptive control framework. State-space control design

methods can be boiled down into two different classes of methods [Paraskevopoulos, 2002]:

Algebraic Control Design Methods Algebraic method belongs to a particular category of mod-

ern control design problem where the controller has a pre-specified structure. In this case,

it is only required to determine the controller parameters for which certain closed-loop

performance requirements are met. These algebraic system design methods address many

control problems such as dead-beat, pole placement, input-output decoupling and exact

model matching design. In the context of adaptive control, these algebraic methods ad-

dressing pole placement problems are named adaptive pole placement methods [Astrom

and Wittenmark, 1994].

State Observer Control Design Methods State observer methods are used to generate a good

estimate x(t) of the state vector x(t) based on the mathematical model of the system;

hence, in this approach it is essential to have an exact mathematical model of the system.

Using these estimates x(t) and state feedback techniques, it is possible to solve many

control problems such as pole assignment, input-output decoupling and model matching.

In the context of the adaptive control, there is no exact mathematical system model; instead,

the parameters are estimated on each cycle based on the input-output relation of the system. In

this case, the algebraic control design method is much more convenient for our adaptive control

scheme. As mentioned previously, the rest of the discussion calls the algebraic design methods

as adaptive pole placement algorithms. The algebraic control design methods are discussed in

details in the Chapter 5 Algebraic Controller Design Methods 5.

In summary, as mentioned previously, different design methodologies can be utilized in

the control system design; however, state-space algebraic design methods, specifically adaptive


pole placement methods are very common in the adaptive control schemes. Therefore, our

identifier-based indirect adaptive control scheme uses the adaptive pole placement method in

order to design the controller. As a future work, this control system design can be improved

with more advanced control design approaches such as H∞ robust control design methods to

increase the robustness in the overall adaptive control scheme.


Chapter 3

System Model Adaptive Control

In this chapter, the mathematical dynamic resource pattern model of a thread in a shared cache

and processor cycle environment is developed. This model, in fact, formulates cache-aware

resource allocation problem in terms of constraints cache miss count, instruction count and

CPU cycles. At the end, this formulation constructs the basis of our cache aware adaptive

closed loop scheduling solution.

3.1 Theoretical Background of Dynamic System Model

In our thesis, thread execution and cache access pattern of the threads, particularly shared Level

2 (L2) cache behavior and corresponding slowdown on processor performance, is modeled. As

might be expected, due to the number of parameters involved and order of system, it is inevitable

to consider a mathematical approach, which is capable of analyzing high order systems with a

number of parameters and has a convenience with implementation on computing environment.

As a result, state-space theories and tools are preferred. The rest of the chapter focuses on the

state-space formulation of these systems and relevant adaptive strategies on these state-space

elements.

3.1.1 State-Space Theory

In the context of the state-space theory, the state space modeling and analysis of the control

systems are discussed. Before proceeding further, some preliminary definitions to the state-

space theory are as follows:

57

58 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL

State The state of a dynamic system is the atomic set of variables (called state variables) in

which knowledge at t = t0 together with the knowledge of input for t ≥ t0 completely

identify the behavior of the system for any time t ≥ t0.

State variables The state variables of a dynamic system are the variables forming the smallest

set of variables that determine the state of dynamic system. For instance, if at least

n variables x1, x2, · · · , xn are required to describe the behavior of a dynamic system,

then such n variables are a set of state variables. Most importantly, state-space theory

provides a significant freedom in selecting state variables such that state variables are not

necessarily either physically measurable or observable quantities. However, in practice it

is convenient to choose measurable quantities if it is possible.

State Vector Considering n state variables are required to completely describe the behavior of

a given system, these n state variables are n components of a vector x, which is called

state vector. In other words, a state vector is a vector that describes the system state x(t)

for any time t ≥ t0, once the state at t = t0 is given and the input u(t) for t ≥ t0 is

defined.

State Space The n-dimensional space <n, which is formed by n independent axis

x1, x2, x3, · · · , xn, is called state space; and any state can be represented by a point in the

state space.

State-Space Equations In state-space analysis, three variables are included in the modeling of

the dynamic systems: input variables, output variables, and state variables. According

to definition of a state vector, a state vector is able to describe system state for any time

t ≥ t0 if and only if the state at t = t0 along with the input u(t) for t ≥ t0 is provided.

In such cases, a dynamic system is able to provide the values of the input t ≥ t0; in other

words, the physical system must involve elements that memorize this information. For

instance, in continuous time control systems the integrators serve as a memory devices

and the outputs of the integrators are considered as state variables. Hence, the number of

integrators is equal to the number of state variables, which define the control systems.

Assume a MIMO system involving n integrator has r inputs u1(t), u2(t), · · · , ur(t) and m

outputs y1(t), y2(t), · · · , ym(t); and n outputs of integrators as state variables x1(t), x2(t),

· · · , xn(t) are defined. Then the system can be defined by the following equations [Ogata,

3.1. THEORETICAL BACKGROUND OF DYNAMIC SYSTEM MODEL 59

1997]:

x1(t) = f1(x1, x2, · · · , xn;u1, u2, · · · , ur; t),

x2(t) = f2(x1, x2, · · · , xn;u1, u2, · · · , ur; t),...

xn(t) = fn(x1, x2, · · · , xn;u1, u2, · · · , ur; t). (3.1)

The outputs y1(t), y2(t), · · · , ym(t) of the system are defined by

y1(t) = g1(x1, x2, · · · , xn;u1, u2, · · · , ur; t),

y2(t) = g2(x1, x2, · · · , xn;u1, u2, · · · , ur; t),...

ym(t) = gn(x1, x2, · · · , xn;u1, u2, · · · , ur; t). (3.2)

These equations can be transformed into a matrix form as:

x(t) =

x1(t)

x2(t)

...

xn(t)

,f(x,u, t) =

f1(x1, x2, · · · , xn;u1, u2, · · · , ur; t)

f2(x1, x2, · · · , xn;u1, u2, · · · , ur; t)

...

fn(x1, x2, · · · , xn;u1, u2, · · · , ur; t)

, (3.3)

y(t) =

y1(t)

y2(t)

...

yn(t)

,g(x,u, t) =

g1(x1, x2, · · · , xn;u1, u2, · · · , ur; t)

g2(x1, x2, · · · , xn;u1, u2, · · · , ur; t)

...

gn(x1, x2, · · · , xn;u1, u2, · · · , ur; t)

,u(t) =

u1(t)

u2(t)

...

un(t)

. (3.4)


Then equations (3.1) and (3.2) can be written:

x(t) = f(x,u, t), (3.5)

y(t) = g(x,u, t). (3.6)

If the vector functions f and/or g is changing with respect to time, then the system is called

a time varying system.

If Equations (3.5) and (3.6) are linearized around the operating state, then linearized state

and output equations are as follows:

x(t) = A(t)x(t) + B(t)u(t), (3.7)

y(t) = C(t)x(t) + D(t)u(t). (3.8)

If a vector function f and g does not involve time t parameter, then the system is called a time-

invariant system. In this case, the equations (3.7) and (3.8) can be simplified as:

x(t) = Ax(t) + Bu(t), (3.9)

y(t) = Cx(t) + Du(t). (3.10)

These equations are the general form of state space and output equations defined for time-

varying systems (3.5), (3.6), for linear time varying systems (3.7),(3.8), and for linear time

invariant systems (3.9),(3.10).

Transformation between state-space representations in time domain and transfer function

representation in Laplace/frequency domain is another commonly used derivation during the

control design; hence, the following formulation will indicate how to derive a MIMO transfer

function from the given state-space equations. Consider a MIMO LTI System with the following

state-space representation:

X(t) = AX(t) +BU(t),

Y (t) = CX(t) +DU(t).

(3.11)


Take the Laplace Transform of the state-space equations:

L[X(t)

]= L [AX(t)] + L [BU(t)] ,

L [Y (t)] = L [CX(t)] + L [DU(t)] . (3.12)

The Laplace Transform gives the following result:

sX(s)−X(0) = AX(s) +BU(s),

Y(s) = CX(s) +DU(s). (3.13)

Separate out the variables in the state equation as follows :

sX(s)− AX(s) = X(0) +BU(s). (3.14)

Factor out X(s):

X(s) = [sI − A]−1X(0) + [sI − A]−1BU(s). (3.15)

Derive Y(s) using X(s) :

Y(s) = (C [sI − A]−1B +D)U(s),

H(s) = C [sI − A]−1B +D. (3.16)

In our computing environment, discrete-time quantities are used; in that case, continuous

time variable t will be replaced with discrete time variable k. Depending on the system type,

the state space representations can take different forms as below:

In this regard, our thread execution pattern model is composed of two interconnected finite-

dimensional discrete linear time-varying deterministic systems. Hence, our model is repre-

sented with corresponding state-space model in Table 3.1.


Table 3.1: System Type vs State Space Model

System Type State-Space ModelContinous time-invariant x(t) = Ax(t) + Bu(t)

y(t) = Cx(t) + Du(t)

Continous time-variant x(t) = A(t)x(t) + B(t)u(t)

y(t) = C(t)x(t) + D(t)u(t)

Discrete time-invariant x(k + 1) = Ax(k) + Bu(k)

y(k) = Cx(k) + Du(k)

Discrete time-variant x(k + 1) = A(k)x(k) + B(k)u(k)

y(k) = C(k)x(k) + D(k)u(k)

Laplace domain of continous time-invariant sX(s) = AX(s) +BU(s)

Y(s) = CX(s) +DU(s)

Z-domain of discrete time-invariant zX(z) = AX(z) +BU(z)

Y(z) = CX(z) +DU(z)

3.1.2 Adaptive Control Theory

As briefly explained in the literature review chapter, adaptive control addresses the inefficiency

of conventional feedback systems in controlling time-varying or unknown systems. More

specifically, systems which have unknown plant, or of which plant parameters are subject

to unpredictable slow variations, are considered in the adaptive control application domain.

Adaptive control provides an adjustment mechanism to re-design the controller to meet control

objectives and specifications iteratively. As a result, continuous adaptations of controller to the

varying system characteristics are achieved.

As previously discussed in the literature review section 2, adaptive self-tuning control schemes

can be categorized into three subcategories as identifier-based adaptive schemes, non-identifier

based schemes and dual adaptive control schemes based on online estimation and parameter

uncertainty compensation capabilities. Here, only identifier-based parametric adaptive control

schemes, which is also referred as adaptive self-tuning control framework throughout the thesis,

is discussed. Please refer to literature review section for more details on other schemes 2.5.1.

Identifier-Based Parametric Adaptive Control Scheme

This class of adaptive control schemes combines an online estimator with a conventional control

law. In this class of schemes, a system plant is identified using online estimator at each control


loop, and based on the estimated plant model, the controller is designed using conventional

controller design methods.

A parametric adaptive control scheme is based on deterministic estimation algorithms rather

than on stochastic ones as in non parametric counterparts. In this scheme, the plant is repre-

sented by parameters, which will be estimated using measured input/output data. This class of

scheme can be approached in two ways:

Indirect (Explicit) Adaptive Control In this approach, the plant parameters are estimated first

and based on these parameters the controller parameters will be derived. That is, at each

time instant t, firstly estimated plant will be formed, and this plant is used to derive

controller parameters using conventional controller design methods. For instance, Figure

3.1 shows the structure of indirect adaptive control. In this figure, for a plant model

G(θ∗) where θ∗ is unknown, the online estimator generates an estimate of unknown θ∗

as θ(t). Then the control system design module generates the controller parameters θc(t)

for corresponding estimated plant parameters θ(t); and the controller parameters are fed

into controller (C(θc)). This whole process is repeated at each control cycle. In this

case, the main challenge in indirect adaptive control is to choose the class of control laws

C(θc), and the class of parameter estimators generating θc(t), so that controller (C(θc))

satisfies the performance requirements for the plant model G(θ∗) [Ioannou and Fidan,

2006]. Astrom and Wittenmark name this scheme as indirect self-tuning regulator.

Figure 3.1: Indirect(Explicit) Adaptive Control [Ioannou and Fidan, 2006]

Direct (Implicit) Adaptive Control In the second approach, the plant model is parametrized


in terms of the desired controller parameters; and then, without calculating plant pa-

rameter estimates, desired controller parameters are estimated. For example, Figure 3.2

demonstrates the structure of direct adaptive control. In the direct adaptive control, the

plant modelG(θ∗) is parametrized in terms of unknown controller parameter θ∗c for which

C(θ∗c ) satisfy the performance objectives [Ioannou and Fidan, 2006]. In this case, the

online estimator will generate estimates of θc(t) of θ∗c by using input and output of the

plant. The estimated θc(t) is then fed into controllerC(θ∗c ). Similar to the indirect scheme,

the main challenge in the direct adaptive control approach is the choice of control laws

C(θ∗c ) and parameter estimators, which ensures the plant meets the performance targets

or requirements. Moreover, this control scheme is only applicable if and only if plant can

be parametrized in terms of unknown controller parameters.

Figure 3.2: Direct(Implicit) Adaptive Control [Ioannou and Fidan, 2006]

In our research project, the identifier-based parametric adaptive control scheme is utilized. The

reason the identifier-based parametric adaptive control scheme is preferred is due to the fact the

deterministic estimation approach is considered for our research project. As a future research

path, it might be worthwhile to compare deterministic and stochastic estimation approaches.

3.2 Development of Thread Execution Pattern Model

As mentioned previously, particularly in the multi-core chip level multiprocessor platform

where some level of cache is shared among cores, execution slot allocation of scheduling of

threads might lead to a performance bottleneck in case shared resources Level 2 (L2) cache is

3.2. DEVELOPMENT OF THREAD EXECUTION PATTERN MODEL 65

excluded in the scheduling decision. In this respect, our research effort and aim are concentrated

on developing a cache aware adaptive closed loop scheduling framework which provides an

optimal and fair allocation of instruction count while taking cache patterns of threads into

consideration. In line with our research aim, a dynamic thread cache access pattern model which

considers L2 cache miss count and co-runner threads miss count as a parameter, is developed

and its impacts on an overall processing speed is investigated. Actual processing power of a

processor is measured in cycles per instruction(CPI). Ideally for an infinite cache, it is possible

to consider CPI as an independent completely processor processing related metric; however, it

is inevitable to have stalls and delays during memory operations of instructions. Hence, Matick

et al. [Matick et al., 2001] define the system performance as

CPIAct = CPIIdeal + FCP, (3.17)

where CPIIdeal is an ideal CPI measured with an infinite cache system, and FCP refers to a

finite cache penalty due to memory stalls or queue delays. Finite cache penalty(FCP) is actually

the sum of all negative factors degrading the performance of a processor. These additional

factors can diversify as the system complexity, namely level of memory hierarchy, structure of

memory levels and number of cores increases. In this regard, for the sake of simplicity, it is

assumed that each core has only a single level of private Level 1 (L1) cache. Furthermore, since

our research scope is limited by the chip level multiprocessors and no static shared memory

partitioning scheme is deployed, cross-interrogation among multiple private caches is ignored

at this stage. In addition, as mentioned previously, it is assumed that caches are by default

write-through caches, so cast-out impact, which is applicable to store-in caches, is ignored. In

addition, the cache reloading time and trailing-edge effect are neglected due to the fact that their

effect is small especially for the large and high performance system with width and high-speed

buses. To sum up, in our cache model only request bus queue (RBQ) and data bus queue (DBQ)

delays, memory access latency (delay) and cache miss count are considered in modeling the

cache behaviors of the multi-core CMP system. Based on the model of Matick et al., the cache

model can be formulated as a linear function of DBQ and RBQ delays, memory access delays

and miss count as follows:

FCPL2 = F (MC,DBQ,RBQ,MAD), (3.18)


where MC, MAD refer to miss count and memory access delay respectively. In this case,

CPIAct in the Equation (3.17) can be expressed in discrete time domain as follows:

CPIAct[k] = CPIIdeal(Cache→∞) + F (MC[k], DBQ[k], RBQ[k],MAD[k]), (3.19)

At this point, we can combine both DBQ and RBQ delays under the BQ variable.

CPIAct[k] = CPIIdeal(Cache→∞) + F (MC[k], BQ[k],MAD[k]). (3.20)

As discussed previously, our cache aware adaptive closed loop scheduling framework imposes

the instruction fairness by allocating extra cycles to the thread, which has a fewer instruction

cycles executed with respect to its dedicated version, in order to ensure that the instruction fair

thread has a fair number of instruction counts (IC) independent of a current cache allocation. In

this respect, the fairness condition can be defined as follows:

Definition 1 Fairness Condition. The fairness condition refers to the operating state in which

thread execution pattern independence on the co-runner cache access pattern is guaranteed.

In other words, the thread has the same amount of instruction count in the shared cache

environment as in the dedicated cache resource environment.

In this case, ∆CPUCycle[k] shall be the system input which ensures the fairness condition

in Definition 1. In this regard, instruction count on the next CPU cycle (CPU quantum) is

formulated as follows:

IC[k + 1] = IC[k] + (CPIAct[k])∆CPUCyc[k],

= IC[k] + (CPIIdeal + F (MC[k], BQ[k],MAD[k]))∆CPUCyc[k]. (3.21)

Since F is a linear function of MC[k], BQ[k] and MAD[k], the Equation (3.21) can be written

as:

IC[k + 1] = IC[k] + (CPIIdeal + A[k]MC[k] +B[k]MAD[k] + C[k]BQ[k])∆CPUCyc[k],

(3.22)

where A[k], B[k], C[k] are time varying parameter matrices referring to the cache miss penalty,

the memory access delay penalty and the buffer queue delay penalty respectively. In this regard,


for the sake of simplicity A[k], B[k], and C[k] are assumed to be constant coefficient matrices;

in other words, cache penalties are considered constant over the time. As a result, A[k], B[k]

and C[k] in the Equation (3.22) can be replaced by A, B and C.

IC[k + 1] = IC[k] + (CPIIdeal + AMC[k] +BMAD[k] + CBQ[k])∆CPUCyc[k]. (3.23)

Remark 1 In fact, this Equation (3.23) is a bilinear equation provided that IC[k], MC[k],

MAD[k] and BQ[k] are dynamic parameter matrices defining the current state of the model

dynamics, which are updated as a part of system dynamics on each sampling period. In this

case, the system model equation can be considered in the bilinear equation form given in Elliott

[2009]:

X(k + 1) = AX(k) +BX(k)U(k). (3.24)

However, it is computationally expensive and hard to solve the bilinear system dynamic;

hence, the following hypothesis is developed to divide the existing bilinear system dynamic

into multiple linear system dynamics.

Hypothesis 2 The bilinear system model in the Equation (3.23) can be boiled down into four

sub-linear system dynamics based on the heuristic assumption that IC[k], MC[k], MAD[k] and

BQ[k] are referring to independent dynamics:

1.

IC[k + 1] = IC[k] + (CPIIdeal + AU1 +BU2 + CU3)∆CPUCyc[k], (3.25)

where U1, U2, U3 are actually outputs of sub-linear systems miss count dynamic system,

memory access delay dynamic system and buffer queue dynamic system respectively;

and these variables U1, U2, U3 are considered as inputs to the main system given in

the Equation (3.25).

2.

MC[k + 1] = A[k]MC[k] + C[k]CMC [k], (3.26)

where A and C refers to the unknown state matrix and the unknown input matrix respec-

tively, and CMC [k] refers to the co-runner miss count average 1p

∑pi=1MC(Ci) as an


indication to the current cache state. In this regard, parameter estimation algorithm is

used to identify these unknown matrices.

3.

MAD[k + 1] = MAD[k] + V [k], (3.27)

is actually a stochastic linear system representation which is dependent on the probabilis-

tic characteristic of V[k] or random noise component.

4.

BQ[k + 1] = BQ[k] +W [k], (3.28)

is also a stochastic linear system representation which is dependent on the probabilistic

characteristic of W[k] or random noise component.

Considering the last three linear sub-system dynamics separately, and estimating the sub-

system characteristics in the Equations (3.26), (3.27) and (3.28), it is possible to determine the

thread execution pattern model (3.25) in that control period. This plant or system model is then

integrated into the adaptive self-tuning control framework in the following section to obtain

time varying robust control over the whole scheduling process.

Figure 3.3 illustrates these four sub-system state-space models (3.25), (3.26), (3.27), and

(3.28) which actually form the open loop thread execution pattern. This model is developed

on MATLAB Simulink platform. The model inputs, colored in green in Figure 3.3, are the

additional CPU cycle allocated to the thread, ∆CPUCyc[k], and the co-runner miss count

average, CMC [k]; whereas, system outputs, in light blue color in Figure 3.3, are the cache

miss count MC[k] and the instruction count IC[k]. In our model, for the sake of simplicity it is

assumed that V[k] and W[k] in Memory Access Delay(MAD) model (3.27) and Buffer Queue

Delay(BQD) model (3.28) respectively are band-limited white noise with 0.1 Watt noise power.

This thread execution pattern model is then used to construct the adaptive self-tuning control

framework.


Figure 3.3: Thread Execution Pattern Model


3.3 Control Framework Selection

Following the thread execution pattern model given in Figure 3.3, it is necessary to build an

adaptive self-tuning control system model. In this regard, our measurement metric is instruction

count IC[k], and ICCache−Dedicated[k] is our reference IC value.The error between actual IC[k]

and ICCache−Dedicated[k] is ∆IC[k] formulated as:

∆IC[k] = IC[k]− ICCache−Dedicated[k], (3.29)

Then, the internal variables miss count MC[k] with ∆IC[k] and the average co-runner miss

count CMC are fed into the adaptive self-tuning control framework as an input. The adaptive

self-tuning control framework module Pole Placement Algorithm uses cache pattern polynomial

coefficients and reference instruction count to design the linear controller at each control period.

The designed controller produces the controller output CPUCyc[k] for a given instruction count

error of the thread at a particular instant. Figure 3.4 illustrates this scenario, in the figure

Thread Execution Pattern Model box refers to Thread Execution Pattern model 3.3 discussed

in the previous section, and the QR Recursive Least Square (RLS) Estimator refers to the

embedded function block which uses input parameters MC[k] and CMC to attain a parameter

estimation in the Miss Count State Space Subsystem. Following the parameter estimation, the

pole placement algorithm module uses estimated parameters and the reference instruction count

signal to design controller, which determines control input CPUCyc[k] for a given instruction

count error ICError for that cycle.The detailed discussion adaptive framework is conducted in

the following section.

Figure 3.4: Closed-Loop CMP Control System Model

3.4. ADAPTIVE CONTROL FRAMEWORK DEVELOPMENT 71

3.4 Adaptive Control Framework Development

In our research project, our aim is to develop a general framework applicable to a wide range

of systems. Furthermore, as can be seen in the thread execution pattern model (3.23), there are

unknown system parameters already parametrized in the model itself. Using online estimation

methods, it is possible to retrieve these parameters. Hence, Self-Tuning Indirect (Explicit)

Adaptive Control scheme, which mostly matches our requirements, is decided upon.

In our thread execution pattern model, the main source of an uncertainty is due to cache

miss count dynamics of the system since the rest of the subsystems are assumed as stochastic

systems with known distribution. In this case, deterministic online estimation algorithms are

only applied to the state-space miss count subsystem defined in the Equation (3.26).

In this regard, the state-space representation of the cache miss count dynamic can be written

as follows under the consideration of general system order n(nth order system).

MC[k + 1] = A[k]MC[k] + C[k]CMC [k],

y[k] = HMC[k].(3.30)

Definition 3 The state vector MC[k], input vector CMC [k], and gain vector H in the Equation

(3.30) can be defined in matrix form as follows:

MC[k] =

[mc1[k] mc2[k] · · · mcn[k]

]T, (3.31)

CMC [k] =

[cMC [k − d] cMC [k − 1− d] · · · cMC [k −m− 1− d]

],

H =

[1 0 · · · 0

],

where H, MC[k] ∈ <n, and CMC [k] ∈ <m, and d refers to time delay between the input and

output. In this instance, the state variables mc1[k], · · · ,mcn[k] are defined as n time shifted

signal.

MC[k] =

[mc[k]mc[k − 1]mc[k − 2] · · ·mc[k − n− 1]

]. (3.32)

In this respect, using the state-space model given in the Equation (3.30) and the Definition 3, it


is possible to derive the input-output relation.

y[k] = mc[k] = A1[k]

mc[k − 1]

...

mc[k − n]

+ C1[k]

cMC [k − d]

...

cMC [k − d−m+ 1]

. (3.33)

ConsideringA1[k] C1[k] as row vectors with coefficients a1, · · · , an and c0, · · · , cm−1, the fol-

lowing discrete-time deterministic autoregressive moving average(DARMA) time series model

can be derived:

y[k] = mc[k] =n∑j=1

ajy(k − j) +m−1∑j=0

cjcMC(k − j − d). (3.34)

Then, the DARMA time series model (3.34) can be expressed in a static parametric model

(SPM) form.

z = y[k] = mc[k] = θ∗Tϕ,

where θ and ϕ are defined

θ∗ = [a1, a2, · · · , an, b0, b1, · · · , bm−1]T ,

ϕ[k] = [mc[k − 1], · · · ,mc[k − n], cMC [k − d], cMC [k − d− 1], · · · , cMC [k − d−m+ 1]]T .

(3.35)

Here, the Miss Count State Space Model is expressed in terms of static parametric model (SPM)

which is composed of known signals vectorϕ and unknown coefficient vector θ∗. In this regard,

parameter estimation algorithm QR Recursive Least Square (RLS) tries to estimate θ∗ for a

given inputs and outputs signals in ϕ[k]. The accuracy of this estimation depends not only on

the properties of the adaptive control law, but also on the properties of the plant input sequence.

It is required to analyze the properties of the adaptive control law and the plant input; the

following lemma summarizes necessary properties of both the adaptive control law and the plant

input for a stable adaptive control scheme.

3.4. ADAPTIVE CONTROL FRAMEWORK DEVELOPMENT 73

Definition 4 A wide class of adaptive control laws can be represented by the following form

ε(k) =z(k)− θT (k − 1)ϕ(k)

m2(k), (3.36)

θ(k) = θ(k − 1) + Γε(k)ϕ(k), (3.37)

where θ(k) refers to the estimate of θ∗ at time t=kT, ϕ(k) is the regressor, z(k) is the output of

the SPM

z = θ∗ϕ, (3.38)

m(k) ≥ 1 > 0 is the normalization signal, and Γ = ΓT > 0 represents the adaptive gain

Ioannou and Fidan [2006].

For the adaptive self-tuning control framework which uses the adaptive control laws in the form

defined in Definition 4, the following lemma is useful in analyzing the stability property as well

as defining estimation accuracy of θ∗.

Lemma 1 Provided that ‖Γ‖|ϕ(k)|2m2(k)

< 2, then the adaptive control law Equations (3.36) and

(3.37) have the following properties Ioannou and Fidan [2006] :

i) θ(k) ∈ `∞ where `∞ is the infinity norm of discrete-time signals.

ii) ε(k),ε(k)m(k), |ε(k)ϕ(k)|, |θ(k)− θ(k −N)| ∈ `2 ∩ `∞. `2 and `∞ refer to the Euclidean

and the infinity norm respectively. In this respect,the Euclidean form of discrete time signal

x x : Z+ → < is

‖x‖2 = (∞∑i=0

|x(i)|2)1/2; (3.39)

whereas, the infinity norm of x is

‖x‖∞∆= supl∈Z+|x(i)|, (3.40)

where sup (suprenum) or least upper bound of a set Z+ refers to smallest positive in-

teger(norm) which is greater than or equal to every positive integer (norm) in Z+. If

x(i) ∈ `2 ∩ `∞, then x(i) ∈ `p for all p ∈ [1,∞). Then, for a x(i) sequence, the norm can


be defined as:

‖x‖p = (∞∑i=0

|x(i)|p)1/p`p = x(i) ∈ <n : ‖x‖p <∞. (3.41)

In this case, the intersection of `2 and `∞ indicates that infinite discrete vector sequences

ε(k), ε(k)m(k), |ε(k)ϕ(k)|, |θ(k)− θ(k −N)| are bounded by all norms, which guarantee

the stability of these sequences on a infinite dimensional vector space.

iii) ε(k),ε(k)m(k), |ε(k)ϕ(k)|, |θ(k)− θ(k −N)| → 0 as k →∞, where N is any finite integer

N ≥ 1. This ensures that all these discrete sequences reach zero as these sequences reach

infinity; in other words, adaptive control law finally reach a steady state.

iv) Provided that ϕ(k)m(k)

is persistently exciting if it satisfies

l−1∑i=0

ϕ(k + i)

ϕT (k + i)m2(k + i) ≥ a0lI, (3.42)

∀k and some fixed integer l > 1 and constant a0 > 0, then the estimate θ(k) approaches to

θ∗ (θ(k)→ θ∗) exponentially.

3.5 Concluding Remarks

In summary, the theoretical formulations so far discussed in this chapter have formulated thread

execution pattern model Miss-Count dynamics in the static parametric model (SPM) form for

the estimation of unknown system parameters. Furthermore, Lemma 1 is provided to analyze

the stability and determine accuracy of the estimation of a given adaptive adaptive law. All these

mathematical formulations are applied as a MATLAB function code to the adaptive self-tuning

control framework block given in Figure 3.4.

Chapter 4

Parameter Estimation in Adaptive Control

In this chapter, one of the main components of adaptive self-tuning control framework, pa-

rameter estimation strategies is investigated. In our thesis, the least square linear estimation

techniques are considered as the most appropriate strategy for our cache aware adaptive closed

loop scheduling framework. In this chapter, particularly adaptive weighted QR Recursive Least

Square (RLS) algorithm is researched, and on the basis of the theory, a proprietary algorithm

is designed. The chapter starts with a brief theoretical background on the parameter estimation

and the least square algorithms family, and ends with the adaptive weighted QR recursive least

square algorithm.

4.1 Theoretical Background of Parameter Estimation

In the context of the adaptive self-tuning control scheme, which is called identifier based

control scheme in the control literature, parameter estimation is crucial in identifying system

parameters. In order to successfully apply parameter estimation to the adaptive self-tuning

control scheme, the proper construction of a model set of the system model is essential. In fact,

the model set M∗ is defined as M∗ = {M(θ)|θ ∈ DM} and each model M(θ) in this set is

associated with a predictor y(t|θ) with an associated prediction error PDF fe(x, θ). Search for

the best model within the model set is a problem of finding a parameter vector θ such that the

prediction error is minimized.

The parameter estimation is actually a mapping from input/output data ZN to the parameter

75

76 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL

vector θ which minimizes the prediction error.

ZN → θN ∈ DM where ZN is

ZN = [y(1), u(1), y(2), u(2), · · · , y(N), u(N)] y and u is output and input samples(4.1)

The evaluation of the candidate model is based on the prediction error of the model M(θ∗)

ε(t, θ∗) = y(t)− y(t|θ∗). (4.2)

For a given data set ZN , prediction error ε(t, θ∗) can be derived for t = 1, 2, · · ·N ; and

the ”best model” is the model M(θ∗) which has the smallest prediction error. To quantify the

prediction error sequence given in the Equation (4.2), there are various approaches but the most

common approach is forming a scalar-valued norm or criterion function, which measures the

size of ε(t, θ∗).

In order to find ”best” parameter vector θ′ corresponding the ”best model” M(θ∗) in the

model set, the prediction error criterion function of all models in the model set M∗ is derived

from the prediction error sequence given in the Equation (4.2) and the θ′ which minimizes the

prediction error criterion function will be the parameter estimate of the system based on input

and output data ZN . The size of the prediction error sequence or input/output data sequence is

design specific, and the bigger the size of error sequence, the more precise the estimation will

be.

More specifically, the minimization of error prediction sequence with respect to parameter

vector θ can be summarized in a number of steps:

1. In the first step, the prediction-error sequence with size of <N is filtered through lin-

ear filter L(q) to remove high-frequency disturbance, slow varying terms not critical in

modeling problems [Ljung, 1998]. This step can be called the pre-filtering stage. The

pre-filtering stage is critical and effective in frequency-domain interpretation of criterion

function (4.4) or objective function:

εF (t, θ) = L(q)ε(t, θ), 1 ≥ t ≤ N. (4.3)

4.1. THEORETICAL BACKGROUND OF PARAMETER ESTIMATION 77

2. The second step includes the construction of the norm or criterion function

VN(θ, ZN) =1

N

N∑t=1

L (εF (t, θ)), (4.4)

where L is a scalar-valued function. In fact, L is a critical factor in determining criteria

in our model evaluation or error optimization process. In this respect, the choice of L

function is important; the most common choice is a quadratic norm where L (ε) = 12ε2

which is convenient both in computation and analysis, but which has a few shortcomings

in terms of robustness. It is also possible to parametrize quadratic norm with respect to

(w.r.t) θ as L (ε, θ). In addition, time-varying L norm is also optional if the measurement

reliability such as noise characteristics and measurement weighting, varies with time. In

such cases, weighting function β(N, t) can be included in the criterion function given in

the Equation (4.4):

VN(θ, ZN) =N∑t=1

β(N, t)L (ε(t, θ), θ). (4.5)

The weighting β(N, t) is useful in recursive identification where estimates θ for different

N is calculated.

3. The third step includes the calculation of the parameter estimate vector θ which minimizes

the criterion function given in the Equation (4.4):

θ = θ(ZN) = argθ∈DMminVN(θ, ZN). (4.6)

As regards to the multi variable system, the criterion function and quadratic criterion is

rewritten to cope with multiple prediction error sequences. In the multi variable systems or the

MIMO systems, the quadratic criterion is modified as:

L (ε) =1

2εTΛ−1ε, (4.7)

where Λ is a symmetric, positive semidefinite p× p matrix that refers to the relative importance

of components of prediction error p×1 vector ε. Using the quadratic criterion the p×p criterion


function is constructed as follows:

VN(θ, ZN) = h(QN(θ, ZN)) where, h(Q) =1

2tr(QΛ−1),

QN(θ, ZN) =1

N

N∑t=1

ε(t, θ)εT (t, θ).(4.8)

After derivation of the criterion function (4.8), the calculation of parameter estimate θ is analo-

gous to single-input case, and will be solved using the same equation (4.6).

Depending on the choice of L (·), pre-filter L(·), model structure and minimization method,

parameter estimation methods are varied. In fact, choices of these elements play a significant

role in selecting the particular estimation method from a family of parameter estimation algo-

rithms. In our thesis, we choose the least-squares parameter estimation family due to its com-

patibility in recursive identification schemes and minimization methods with low computational

costs.

4.2 Least Square Parameter Estimation

Least square estimation provides a solution to an inexactly specified system of equations;

in other words, overdetermined systems. An overdetermined system refers to a system of

equations where there are more equations than unknowns; and in fact, these equations fails

to find an exact solution for this system. As a result, for a given system of equations, the least

square estimation provides an approximate solution. To be more precise, least square estimation

is an optimization problem with a known cost/error function, and the solution to this problem

is a set of parameter estimates, which minimizes this error/cost function. For instance, let us

assume that g(ϕ) refers to an overdetermined system, and a number of independent inputs ϕ,

so called regressor, and output y are known. However, for this system, it is impossible to find

an exact solution or pattern. In this case, the least square estimation algorithms provides a set

of parameters θ which is a linear weight of a given regressor inputs, and the combination of

this set of parameters with the regression vector gives the approximation of the system output.

In this circumstance, an error is the difference between the actual and approximate system

output. Indeed, least square estimation algorithms aims to find the best parameter vector, which

minimizes this error.

4.3. ADAPTIVE WEIGHTED QR RECURSIVE LEAST SQUARE ALGORITHMS 79

As a reminder, please note that regression model refers to a set of functions, which statisti-

cally or deterministically predicts the system behavior g(ϕ) in terms of a set of parameter θ for

a given regression inputs ϕ and system outputs y. In fact, this system behavior g(ϕ) predicts

system output, y(t|θ).

4.3 Adaptive Weighted QR Recursive Least Square Algorithms

In traditional recursive least square algorithms, the stability depends on the autocorrelation

matrix R = (ϕTϕ), where ϕ is the input data vector, and inverse autocorrelation matrix P.

Despite the fact R is considered the non-singular, inverse of R (R−1 = P ) can become ill-

conditioned depending on the input signal persistence condition or characteristic.

Considering these possibilities, it is necessary to develop a better strategy to solve the

Recursive Least Square (RLS) problem. Hence, the QR decomposition approach, which is

numerically well-conditioned, is deployed. It actually prevents inaccurate solutions to the RLS

problem, and allows easy monitoring of the positive definiteness of a transformed input matrix.

4.3.1 Theoretical Background

To begin with a quick review, Recursive Least Square (RLS) algorithm searches for the coef-

ficient of the adaptive filter, which minimizes the cost function in a recursive way. RLS cost

function is defined as follows:

ξd(k) =k∑i=0

λk−iε2(i), (4.9)

ε = mc(k)−ϕ(k)w(k), (4.10)

ϕ(k) = [mc(k)mc(k − 1) . . .mc(k −N)cmc(k)cmc(k − 1) . . . cmc(k −N)]T , (4.11)

w(k) = [w0(k)w1(k) · · ·wN(k)],

where y or mc refers to the output of the system, which is miss count. Here, cmc, which is co-

runner miss count, refers to input of the system with miss count (mc). Moreover, λ,ϕ(i),w and

ε refer to adaptive weight coefficient, input regression vector, parameter weight vector and error

vector respectively. Please note that despite the fact that previously θ notation has been used


for parameter coefficient vector. From now on, to prevent the confusion with Givens rotation

matrices, w notation will refer to parameter coefficient vector. To begin with, for N-dimensional

system, the following input data matrices are constructed using the elements: actual output miss

count (mc) and co-runner miss count (cmc).

ΨT (k) = Ψ,

=

mc(k − 1) λ1/2mc(k − 2) · · · λ(k−1)/2mc(k −N) λk/2mc(k −N − 2)

mc(k − 2) λ1/2mc(k − 3) · · · λ(k−2)/2mc(k −N − 2) 0

...... . . . ...

...

mc(k −N − 1) λ1/2mc(k −N − 2) · · · 0 0

cmc(k − 1) λ1/2cmc(k − 2) · · · λ(k−1)/2cmc(k −N) λk/2cmc(k −N − 2)

cmc(k − 2) λ1/2cmc(k − 3) · · · λ(k−1)/2cmc(k −N − 2) 0

...... . . . ...

...

cmc(k −N − 1) λ1/2cmc(k −N − 2) · · · 0 0

,

=[ϕ(k)λ1/2ϕ(k − 1) · · ·λ1/2ϕ(0)

].

(4.12)

Following input matrix construction, the next step is the calculation posterior error vector

ε, which is actually the difference between estimate ˆmc(k) and actual value of miss count

mc(k):

ˆmc(k) = ϕ(k)θ(k) =

ˆmc(k)

λ1/2 ˆmc(k − 1)

...

λk/2 ˆmc(0)

, (4.13)


mc(k) =

ˆmc(k)

λ1/2 ˆmc(k − 1)

...

λk/2 ˆmc(0)

, (4.14)

ε =

ε(k)

λ1/2ε(k − 1)

...

λk/2ε(0)

= mc(k)− ˆmc(k). (4.15)

Following posterior error vector derivation, the cost function is derived using the posterior error

vector given in the Equation (4.16):

ξd(k) = εTε. (4.16)

For such system and posterior error equations, the optimal RLS solution can be formalized as

given in the Equation (4.17):

ΨT (k)Ψ(k)θ(k) = ΨT (k)mc(k), (4.17)

where conventional RLS solution can be ill-conditioned because of the possibility of ill-conditioned

behavior of RD(k) = ΨT (k)Ψ(k) and its inverse due to the loss of input persistence excitation.

To avoid this, the QR decomposition approach is used.

4.3.2 Formulation and Theoretical Conclusions

The QR decomposition approach in a recursive least square (RLS) problem can in fact be

better explained in three stages: initialization, input data matrix triangularization, and QR-

Decomposition RLS Algorithm.


1st Stage:Initialization

During the initialization stage, for a given input ϕ(k) and output mc(k) data from k=0 to k=N,

it is possible to derive initial parameter vector w using RLS solution given in the Equation

(4.17). This process is so-called using back-substitution algorithm, yielded with the following

equation:

wi(k) =−∑i

j=1 ϕ(j)wi−j(k) +mc(i)

ϕ(0). (4.18)

These parameter coefficients are considered as the initial condition for the RLS algorithm. The

reason the back-substitution method is preferred is that it does not use the matrix inversion

method while calculating parameter vector elements; in other words, it is a well-conditioned

algorithm. However, it is necessary for this method to have a data matrix in triangular form.

As a result, triangularization of input data matrix is conducted as a preliminary step for the

back-substitution algorithm.

2nd Stage: Input Data Matrix Triangularization

As mentioned previously, input data matrix triangularization is a preliminary step for back-

substitution, and this triangularization step should be repeated recursively on each control

cycles. This iterative triangularization is inevitable as long as new input/output information

gets into the system. As a result input data matrix, at k=N+1, has the following form:

Ψ(N + 1) =

mc(N + 1) mc(N) · · · mc(1) cmc(N + 1) cmc(N) · · · cmc(1)

λ1/2Ψ(N)

,

=

ϕT (N + 1)

λ1/2Ψ(N)

.(4.19)

As might be seen, the matrix Ψ(N + 1) (4.19) is no longer in a triangular form despite

Ψ(N) is. Hence, it is necessary to triangularize the matrix Ψ(N + 1) using the orthogonal

triangularization approach, such as Givens rotation. Alternatives of orthogonal triangularization

approaches are Householder transformation or Gram-Schmidt orthogonalization. However, due


to the recursive nature of RLS algorithm, Givens rotation is more convenient in our research.

In the Givens rotation approach, the aim is to eliminate each element in the first row of the

input data matrix (4.19). This is achieved by multiplying Ψ(N + 1) with a series of Givens

rotation matrices. A Givens rotation matrix in this series is in the form given in the Equation

(4.20):

Qθi(k) =

cos(θi(k)) · · · 0 · · · − sin(θi(k)) · · · 0

......

...

... Ii 0 · · · 0

......

...

sin(θi(k)) · · · 0 · · · cos(θi(k)) · · · 0

......

... IN−i

0 · · · 0 · · · 0

. (4.20)

In the Equation (4.20), Givens rotation angle θi(k) is calculated in a way that overall orthog-

onalization purpose eliminating the first row elements of input data matrix (4.20) is achieved.

Each matrixQθi(k) in the series of the Given Rotation matrices Q(k) actually eliminates a single

element in the first row. According to this fact, θi is calculated using the following expressions:

cos(θi(k)) =Ui(k)i+1,2∗N+1−i

ci,

sin(θi(k)) =ϕi(k − 2N − i)

ci,

where ci is

ci = 2

√Ui(k)2

i+1,N+1−i + ϕi(k − 2 ∗N − i)2.

(4.21)

Then, this set of Given rotation matrices will be multiplied with the input data matrices. As

might already be observed, the size of rotation matrices is equal to the number of the elements

in the first row of the input data matrices. For an input data matrix Ψ(k) (4.12) which has a 2N

elements in its first row, the Given rotation matrices set includes 2N elements as shown in the


Equation (4.22):

Qθ(k) = Qθ2∗N(k)Qθ2∗N−1

(k) · · ·Qθ2∗N−i(k) · · ·Qθ0(k). (4.22)

Recursively at each time instant k, rotation angles θ2∗N · · · θ0 are calculated and corresponding

Qθimatrix is derived using the scheme defined in the Equation (4.20). The final step is the mul-

tiplication of each of these Given rotation matrices with input data matrix; and, the derivation

of the triangularized input matrices.

3rd Stage: QR-Decomposition RLS Algorithm

In this section, the triangularization procedure so far discussed is applied to the Recursive Least

Square (RLS) problem; and the algorithm is expected to be more stable and well-conditioned

since it avoids the calculation inverse autocorrelation matrix of conventional RLS algorithms.

The conventional recursive least square(RLS) algorithm considers the posterior error as the

difference between actual measurement and estimate. However, in our research,these terms

refer to actual miss count and estimated miss count, respectively.

ε(k) =

ε(k)

λ1/2ε(k − 1)

...

λk/2ε(0)

=

mc(k)

λ1/2mc(k − 1)

...

λk/2ε(0)

−Ψ(k)w(k) (4.23)

In Equation (4.23), the posterior weighted error vector is written as a function of input data

matrix, actual output data and parameter coefficient vectors. According to the QR approach,

both sides of the posterior error vector equation (4.23) are premultiplied by Givens rotation


matrix (4.22):

Q(k)ε(k) = Q(k)mc(k)−Q(k)Ψ(k)w(k),

= mcq(k)−

0

U(k)

w(k),

where

εq(k) =

εq1(k)

εq2(k)

...

εqk+1(k)

,

and

mcq(k) =

mcq1(k)

mcq2(k)

...

mcqk+1(k)

,

(4.24)

Due to the fact that Givens rotation matrix Q(k) is orthogonal, the cost function or weighted

square error ξd(k) of the QR decomposed RLS problem and of the conventional RLS problem

is equivalent as proved in the following equation:

ξd(k) = εTq (k)εq(k),

= εT (k)QT (k)Q(k)ε(k) = εT (k)Iε(k).(4.25)

The weighted square error in the Equation (4.25) is minimized using a back-substitution algo-

rithm. This is achieved by deriving w such that εqk−N+1to εqk+1

are set to zero in the Equation

(4.26) [Diniz, 2008]:

wi(k) =−∑i

j=1 u2N+1−i,i−j+1(k)wi−j(k) +mcqk+1−i(k)

uN+1−i,i+1(k), (4.26)


for i = 0, 1, · · · 2 ∗N , where∑i−1

j=1 = 0.

More precisely, QR actual miss vector can be written as a function of a weighted posterior

vector given in the Equation (4.27).

mcq(k) =

mcq1(k)

−−−−

mcq2(k)

=

mcq1(k)

...

mcqk−2∗N (k)

−−−−−

mcqk−2∗N+1(k)

...

mcqk+1(k)

=

εq1(k)

...

εqk−N(k)

0

...

0

+

0

U(k)

w(k), (4.27)

where w(k) is the optimum coefficient vector at instant k. From the Equation (4.27), it can

easily be observed thatmcq1 in fact does not depend on the parameter coefficient vector w(k);

mcq1 =

mcq1

...

mcqk−2∗N

=

εq1(k)

...

εqk−N(k)

, (4.28)

whereas, mcq2 is equivalent to matrix multiplication of U with parameter vector w(k) as shown

in the Equation (4.29):

mcq2 =

mcqk−2∗N+1

...

mcqk+1

= U (k)w(k). (4.29)

Indeed, the Equation (4.29) is a back-substitution equation given in (4.18) in matrix notation.

In addition, due to the fact that Q is the multiplication of a series of orthogonal Givens rotation


matrices Q2N · · ·Q0, it is possible to reach the conclusion that Q can be rewritten in a recursive

form. According to the Equation (4.23), the input data matrix can be triangularized using Givens

rotation matrix Q:

Q(k)Ψ(k) =

0

U(k)

. (4.30)

Now, using the fact that Givens matrices are orthogonal (QTQ = I), the Equation (4.30) can be

written in the following form:

Q(k)

1 0T

0 QT (k − 1)

1 0T

0 Q(k − 1)

ϕT (k)

λ1/2ψ(k − 1)

=

0

U(k)

. (4.31)

Considering

Q(k) = Q(k)

1 0T

0 QT (k − 1)

, (4.32)

then the Equation 4.31 can be rewritten as :

Q(k)

ϕT (k)

λ1/2ψ(k − 1)

=

0(k−N)×(N+1)

U (k)

. (4.33)

This is in fact equal to

Q(k)

ϕT (k)

0(k−N−1)×(N+1)

λ1/2U (k − 1)

=

0(k−N)×(N+1)

U(k)

. (4.34)

Using the Equations (4.32) and (4.34), it is possible to deduce the structure of Q. In fact,

its structure includes an identity matrix Ik−N−1, such that it is represented in the following


structure [Apolinario and Miranda, 2009]:

Q(k) =

∗ 0 · · · 0 ∗ · · · ∗

0 0 0

... Ik−N−1...

0 0 0

∗ 0 · · · 0 ∗ · · · ∗

...... . . . ...

... . . . ...

∗ 0 · · · 0 ∗ · · · ∗

. (4.35)

According to the given structure of Q, it is feasible to degrade the ever increasing order of

Q, which is (k + 1) × (k + 1). This is achieved by removing the Ik−N−1 section along with

corresponding columns and rows. Hence, Givens rotation matrices given in (4.22) with size of

(N +2)× (N +2) are constructed. Using this matrix, the following equation, which defines the

miss count vector in terms of actual miss count, Given Rotation matrices and triangularized miss

count vector is obtained, as shown in the Equation (4.36). In this way, the other intermediate

quantities are eliminated.

mc(k) =

εq1(k)

mcq2(k)

= Qθ(k)

mc(k)

λ1/2mcq2(k − 1)

. (4.36)

The Equation (4.36) can also be rewritten in the form where Givens rotation matrices are

decomposed into a series of Givens Rotation matrices as follows:

mc(k) = Qθ2∗N (k)Qθ2∗N−1(k) · · ·Qθi

(k)

mci(k)

mcq2i(k − 1)

, (4.37)

where Qθi(k) can be derived using the Equation (4.22). In addition, it is worthwhile to state

that a posterior output error can be computed without explicit derivation of w [Diniz, 2008].


This provides significant computational cost saving.

The formulations and derivations of the QR RLS Algorithm so far given throughout the

section can be summarized with a few remarks as:

1. For a given input and output data, the initial parameter vector is constructed using the

back-substitution method given in the Equation (4.18).

2. The QR triangularization algorithm is applied to the posterior error vector equation given

in the Equation (4.23); a set of Givens rotation matrices is generated using the definition

(4.22). Each of these rotation matrices is applied iteratively to both sides of the equation

in order to obtain the triangularized equation given in the Equation (4.24).

3. From these well-conditioned triangularized vectors and matrices, the relation between

output and input data matrices is formulated in terms of the unknown parameter coeffi-

cient vector w(k) given in the Equation (4.29).

4. Based on this relation defined by the Equation (4.29), a back-substitution algorithm is

constructed to calculate unknown parameter vector elements.


Algorithm 1 Adaptive QR Recursive Least Square Estimation Algorithm

1. Initialize parameter vectorw, input data matrixX(N), upper triangular matrixU0(N+1)and system output vectormcq20(N + 1):

x2N×1(0) =

mc(k)

mc(k − 1)...

mc(k −N)

cmc(k)

cmc(k − 1)...

cmc(k −N)

(4.38)

For i=1 to 2NDo for j=1 to N

if(N + 2− i− j > 0)

X(i, j) = −λi−12 ∗mc(N + 2− j − i) (4.39)

EndEndDo for j=N+1 to 2N

if((2N + 2− j − i) > 0)

X(N)(i, j) = λ(i−1)/2 ∗ cmc(2N + 2− j − i) (4.40)End

EndEndFor k=1 to N

Do for i=1 to k

wi(k) =

∑ij=1 x(j)wi−j(k) +mc(i)

x(0)(4.41)

EndEnd

U′0(2N + 1) = λ1/2X(2N + 1) (4.42)

mc′q20(2N + 1) =

[λ1/2mc(2N) λmc(2N − 1) · · · λ(N+1)/2mc(0)

](4.43)


2. For each time cycle k, 2N + 1 ≤ k, compute:

γ−1 = 1, (4.44)

mc′

0(k) = mc(k), (4.45)

x′

0 = xT (k). (4.46)

(a) Triangularization of input matrixX and output vectormc

i. Calculation of Givens Rotation Matrices elements: cosθi and sinθi

Do for i=0 to 2N (4.47)

ci =√

[U′

i(k)]2i+1,2N+1−i + x′i2(2N + 1− i) (4.48)

cos(θi) =[U′

i(k)]i+1,2N+1−i

ci(4.49)

sin(θi) =x′2i (2N + 1− i)

ci(4.50)x′ i+1(k)

U′i+1(k)

= Q′

θi(k)

x′i

U′i(k)

(4.51)

γ′

i = γ′

i−1 cos(θi) (4.52) mc′i+1(k)

mc′q2(i+1)(k)

= Q′

θi(k)

mc′i

mc′q2i(k)

(4.53)

End (4.54)

mc′q20(k + 1) = λ1/2mc

′q22N+1

(k + 1) (4.55)

U′

0(k + 1) = λ1/2U′

2N+1(k) (4.56)

γ(k) = γ′

2N (4.57)

ε(k) = d′2N+1(k)γ(k) (4.58)

ii. Back Substitution Algorithm is applied to find parameter coefficients

mc =[mc

′2N+1mc

′

q22N+1(k)]

(4.59)

w0(k) =mc2N+1(k)

U2N,1(k)(4.60)

Do for i=1 to 2N (4.61)

wi(k) =−∑i

j=1 U2N+1−i,i−j+1wi−j(k) +mc2N+2−i(k)

U2N+1−i,i+1(k)(4.62)

End (4.63)End (4.64)


4.3.3 Complexity Analysis of QR-RLS Algorithm

The algorithmic complexity of QR-RLS algorithm actually improve complexity of conventional

least square algorithm (RLS) from Θ(n3) to Θ(n2) [Apolinario and Miranda, 2009]. This

complexity can be improved to Θ(n) using Fast QR-RLS algorithms; however, due to scope

and time frame of our research, further theoretical analysis and the Fast QR-RLS algorithm are

recommended as a future work.


In this chapter, a proprietary parameter estimation algorithm is applied to our adaptive self-

tuning control framework. Due to the simplicity and low computational cost as a recursive

process repeated every control period, Recursive Least Square (RLS) estimation algorithm

family is chosen. Among the family of recursive least square algorithms, adaptive weighted

QR Recursive Least Square (RLS) is considered due to the fact that conventional Recursive

Least Square (RLS) algorithm suffers from ill-conditioned estimation process. On the basis of

theoretical facts, an algorithm is developed for further simulation and experiment purposes. The

performance of our algorithm is validated at the Experimental Setup and Simulation chapter.

Chapter 5

Algebraic Controller Design Methods

In this chapter, the controller design module of our adaptive self-tuning control framework

is developed. The controller design problem, which is, in fact, a recursive controller design

problem for a given time varying plant constraints, is addressed using algebraic controller

design methods. At the end of this chapter, our aim is to formulate a controller design problem

in terms of varying system constraints.

5.1 Introduction and Background

Algebraic controller design methods address to a particular category of modern control design

problems where the controller has a pre-specified structure. In this case, it is only required to

determine the controller parameters for which certain closed-loop performance requirements are

met. These algebraic system design methods solve many control design problems such as input-

output decoupling and exact model matching design. In the context of adaptive control, these

algebraic methods addressing pole placement problems are named adaptive pole placement

methods [Astrom and Wittenmark, 1994]. Ioannou and Fidan categorize the adaptive pole

placement methods into two as:

Direct Adaptive Pole Placement In the first approach, so called direct adaptive pole place-

ment (APPC), the system dynamics can be parametrized in terms of controller parameters.

In this case, the parameter estimation algorithms produce explicitly controller parameters

rather than system parameters. That is, the parameters of control law are generated

directly by solving algebraic equations without any intermediate step involving estimates

93

94 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS

of system parameters. However, the direct APPC approaches are only applicable to

a special class of systems where controllers can be expressed in terms of controller

parameters. This type of approach is deployed in the Direct(Implicit) Adaptive Control

Scheme 3.2, Model Reference Adaptive Control Scheme, and Dual Adaptive Control

Systems.

Indirect Adaptive Pole Placement In the second approach, indirect adaptive pole placement

(APPC) the recursive parameter estimation algorithms identify system or plant parameters

rather than directly estimating controller parameters. These system or plant parameters

are input into control design algorithms such as pole placement to find the corresponding

controller parameters. Such an approach is common in the Indirect Adaptive Control

Scheme 3.1. Indirect adaptive pole placement (APPC) schemes are easy to design and

can be applicable to a wide class of systems; however, the uncertainty in the parameter

estimation might lead to instability in the controller. In this case, this issue can be

eliminated by using advanced control methodologies such as linear quadratic controller

design, or robust control system design methodologies.

In our adaptive self-tuning control framework, the indirect adaptive pole placement approach

is used with simple control design methodology, which creates a linear controller at each

control period for a given set of system or plant parameters and a reference signal or command.

Figure 5.1 illustrates a very basic adaptive self-tuning control framework with a general linear

controller with two degrees of freedom. In this case, for a given single input single output

system, system or plant can be defined with the following equations:

A(q)y(t) = B(q)(u(t) + v(t)), (5.1)

where y(t) is the output, u is the input and v is a disturbance. A and B are polynomials, which

are represented by the forward shift operator q powers, have degrees deg A=n and degB =

degA− degd0, where d0 is pole excess. In addition, a general controller can be described by

Ru(t) = Tuc(t)− Sy(t), (5.2)

where R, S and T are polynomials. The control law (5.2) u(t) = TRuc(t) − S

Ry(t) refers to

negative feedback (−y(t)) with the transfer function of S/R, and a feedforward (uc(t)) with

5.1. INTRODUCTION AND BACKGROUND 95

the transfer function T/R. Since the controller has two independent components, feedback

and feedforward, which determine control input u(t), the controller has a ’two degrees’ of

freedom. In this instance, a closed loop system shown in Figure 5.1 can be used to describe the

closed-loop system corresponding to this system and controller. The closed-loop characteristic

Figure 5.1: A General Linear Controller with Two Degrees of Freedom [Astrom andWittenmark, 1994]

polynomial can be derived for the closed-loop system shown in Figure 5.1 after a few steps of

derivation by neglecting u as:

y(t) =BT

AR +BSuc(t) +

BR

AR +BSv(t), (5.3)

u(t) =AT

AR +BSuc(t)−

BS

AR +BSv(t), (5.4)

AR +BS = Ac. (5.5)

In this regard, the key objective is to specify the desired closed loop characteristic Ac, and solve

R and S for given system parameters A and B. At this stage, the algebraic controller design

algorithms derive this system level controller problem in terms of polynomial coefficients.

Then, to solve the controller coefficients in terms of system polynomial coefficients, these

design algorithms use some well settled mathematical theorems such as Diophantine Equations,

Uncertain Coefficient Methods. As a result, a set of controller parameters for a given time

varying system and for the pre-determined reference closed loop is calculated. In fact, using an

algebraic design approach allow this strategy to be applied in a recursive manner.

Depending on the closed loop system response characteristic, algebraic controller design

algorithms can vary. In this thesis, two algebraic controller design algorithms are investigated:

deadbeat controller design algorithm and adaptive pole placement algorithm. While deadbeat

controller design algorithm proposes solid but not practical closed loop system response, pole

placement controller design algorithm brings a flexibility to designers to determine closed loop


system response. In the following section, these two algebraic control design algorithms are

applied into our adaptive self-tuning control framework.

5.2 Theoretical Formulation

As discussed previously, our adaptive self-tuning control framework involves estimation of

plant parameters, and the design of a controller. These two tasks also guarantee the tracking

of reference signals in highly unpredictable environments. In this chapter, the design of a con-

troller is covered using algebraic controller methods, dead beat controller and pole placement

algorithm.

5.2.1 Deadbeat Controller Design

Dead beat controller design algorithm is an endeavor to find a set of controllers which achieves

zero level tracking error in a finite number of control steps. According to the control theory,

this is achieved by selection of controller parameters so that closed system poles are located

at the origin of z plane. The pole at the origin refers to a pulse in the time domain; in other

words, minimum translation time and infinite controller gain. However, this design method is

easily applicable to linear closed systems, and it is still an open research question for non-linear

systems.

There are two version of deadbeat design method: strong version and weak version. In the

strong version, the closed loop achieves the steady state after a finite number of control steps;

whereas, in the weak version, the closed loop system only converges to the steady state in a finite

number of control steps. In our project, the strong version of the deadbeat controller design

algorithm is covered. In this section, using the strong version of the deadbeat algorithm, an

algebraic formulation of the controller design problem is achieved for a given plant parameters

and a reference signal.

For a given dynamic thread cache access pattern model given in the Equation (3.30) and

derived parameter coefficient vectors in the Equation (4.26) or (4.17), it is possible to find such

a controller that the closed loop system response achieve zero tracking error in a finite number

of steps.

5.2. THEORETICAL FORMULATION 97

According to the dynamic cache access pattern model in the Equation (3.30) and the adap-

tive closed loop system model given in the Equation (3.23), it is possible to derive a closed

loop transfer function, and formulate the closed loop system response in terms of an unknown

controller and known plant characteristic polynomials. More specifically, the basic steps of the

deadbeat controller design algorithm are summarized in the following Algorithm 2.

So far, a brief discussion on the procedure has been conducted through the Algorithm 2.

However, it is still necessary to stress some additional points in the algorithm in more details.

1. First of all, it is necessary to further discuss polynomial orders given in the Equation

(5.14). To begin with, it is worthwhile to mention that in the course of developing an

adaptive self-tuning control framework, a set of design decisions has been conducted on

the structure and complexity of plant and reference signals; for instance, the degree of the

plant transfer function A(z−1) and B(z−1), and the reference signal transfer function

ICCacheDedicated = NIC(z−1)DIC(z−1)

structures and orders. Namely, polynomial A(z−1) and

B(z−1) are constructed according to the parameter estimation equation (4.62), which

is in fact a representation plant of closed loop system. In our case, we consider plant

dynamics as a second order transfer function composed of polynomial equations A(z−1)

and B(z−1). For estimated plant coefficients ai and bi where i=0,...2, the polynomial

equations A(z−1) and B(z−1) can be written as:

A(z−1) = a0 + a1z−1 + a−2

2 ,

B(z−1) = b0 + b1z−1 + b−2

2 .(5.15)

At this stage, other important design decisions can be considered such as the construc-

tion of the reference signal structure. According to polynomial degree criterion given

in the Equation (5.14), second order reference signal ICCache−Dedicated = NIC(z−1)DIC(z−1)

is

constructed:NIC(z−1) = icCache−Dedicated,

DIC(z−1) = (1 + (0.8)z−1 + (0.25)z−2).(5.16)

Finally, based on all design decisions made, it is possible reach the conclusion on the order

of the controller polynomial system based on the past design choices and the criterion

equation (5.14). As a result, the controller plant transfer function polynomials orders are


Algorithm 2 Deadbeat Controller Design Algorithm

1. For a given state-space models (3.23) and (3.30), overall closed loop transfer function interms of estimated plant parameters and unknown controller parameters is obtained:

IC(k)

ICCache−Dedicated(k)=

Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))

P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)),

(5.6)

where A(z−1), B(z−1) refers to the cache access pattern (miss count) dynamic transferfunction, P (z−1), Q(z−1) indicates the controller transfer function, CP is the cache misspenalty, and K = CPIIdeal + Delay(k) is the sum of ideal cycle per instruction for thethread and memory access delay.

D(z−1)

C(z−1)= CPIIdeal +Delay(k) + CP

B(z−1)

A(z−1)Cmc(k) (5.7)

2. From the closed loop transfer function in the Equation (5.6), a number of solvablealgebraic polynomial system equations, so called Diophantine equations is derived:

A(z−1)P (z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = 1, (5.8)

ICCache−Dedicated dedicated cache reference signal is considered as the transfer functionNIC(z−1)DIC(z−1)

. (5.9)

After deriving first Diophantine in the Equation (5.8), tracking error of the closed loopsystem can be formulated as:

E(z−1) = [1− Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))

P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))]NIC(z−1)

DIC(z−1),

(5.10)

= [(1−Q(z−1))(K ∗ A(z−1) + CP ∗B(z−1))]NIC(z−1)

DIC(z−1). (5.11)

Under the assumption that division of (1−Q(z−1))(K ∗A(z−1) +CP ∗B(z−1)) by DIC

polynomial also result in a polynomial S(z−1)

: S(z−1) =1−Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))

DIC(z−1), (5.12)

which leads to the last Diophantine equation

S(z−1)DIC(z−1) +Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = 1. (5.13)


3. Based on the plant transfer function polynomials degrees, which might be consideredas a design decision done at the parameter estimation stage, it is likely to estimate thepolynomial degrees for the controller transfer function as follows:

∂S(z−1) = max(∂A(z−1), ∂B(z−1))− 1,

∂Q(z−1) = ∂DIC(z−1)− 1,

∂Q(z−1) = ∂A(z−1)− 1,

∂P (z−1) = max(∂B(z−1), ∂A(z−1)).

(5.14)

4. Using the fact given in the Equations (5.14), it might be a good approach to rewrite theDiophantine Equations (5.13), (5.8) in a polynomial notation.

5. It might be considered as a good strategy to use the Uncertainty Coefficient method toderive polynomial coefficients of the controller transfer function.

obtained as 1st order polynomial Q(z−1) and 2nd order polynomial P (z−1):

Q(z−1) = q0 + q1 ∗ z−1, (5.17)

P (z−1) = p0 + p1 ∗ z−1 + p2 ∗ z−2, (5.18)

where the coefficients qi and pj are unknown. In fact, algebraic controller design methods

are seeking for an expression, which writes these coefficients in terms of known plant and

reference polynomial equations.

2. In this case, the Diophantine Equation (5.8) is a good start to derive an expression stating

the relationship between the plant and controller polynomial equations. On the Dio-

phantine Equation (5.8), the Uncertain Coefficient Method is applied to express con-

troller polynomial equations in terms of other known plant and reference polynomial

coefficients. Indeed, uncertain coefficient method is an endeavor to solve an unknown

polynomial by equating coefficients of equal power terms. This approach leads to a

system of nonlinear equations, in which the solution corresponds to controller polyno-

mials coefficients. By rewriting the Diophantine Equation (5.8) in terms of polynomial


coefficients, the following set equations are derived:

A(z−1)P (z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = 1,

(p0 + p1 ∗ z−1 + p2 ∗ z−2)(a0 + a1 ∗ z−1 + a2 ∗ z−2)+

(q0 + q1 ∗ z−1)K(a0 + a1 ∗ z−1 + a2 ∗ z−2)+

(q0 + q1 ∗ z−1)CP (b0 + b1 ∗ z−1 + b2 ∗ z−2) = 1.

(5.19)

It is more convenient for the Uncertain Coefficient Method to express the Equation (5.19)

in a polynomial form; and using the distribution property, the following expansion is

achieved:

p0a0 + (p0a1 + p1a0)z−1 + (p0a2 + p1a1 + p2a0)z−2 + (p2a1 + p1a2)z−3 + (p2a2)z−4+

Kq0a0z−1 + (Kq0a1 +Kq1a0)z−2 + (Kq1a1 +Kq0a2)z−3 + (Kq1a2)z−4+

(CPq0b0)z−1 + (CPq0b1 + CPq1b0)z−2 + (CPq1b1 + CPq0b2)z−3 + (CPq1b2)z−4 = 1.

(5.20)

The resultant nonlinear system of equations can be expressed either in a matrix form

(5.21) or in an algebraic form as a set of nonlinear equations (5.22). Despite the fact that

a matrix form of representation is only a compact representation of an algebraic one, for

the sake of computational simplicity, it is a good approach.

a0 0 0 0 0

a0 a0 0 (Ka0 + CPb0) 0

a2 a1 a0 (Ka1 + CPb1) (Ka0 + CPb0)

0 a2 a1 (Ka2 + CPb2) (Ka1 + CPb1)

0 0 a2 0 (Ka2 + CPb2))

p0

p1

p2

q0

q1

=

1

0

0

0

0

(5.21)

In algebraic form, the controller polynomial coefficients are derived by equating coeffi-

cients of equal powers at both sides of the Equation (5.20). As a result the following set


of coefficients is inferred:

p0a0 = 1,

(p0a0 + p1a0 +Kq0a0 + CPq0b0) = 0,

(p0a2 + p1a1 + p2a0 +Kq0a1 +Kq1a0 + CPq0b1 + CPq1b0) = 0,

(p2a1 + p1a2 +Kq1a1 +Kq0a2 + CPq1b1 + CPq0b2) = 0,

(p2a2 +Kq1a2 + CPq1b2) = 0.

(5.22)

3. Despite both of these uncertain coefficient methods having a different approach, in fact,

both give the same set solution for controller transfer function coefficients. To provide

a better, simpler presentation, a number of constants given in the Equation (5.23) are

defined and used in the controller coefficient expressions.

H1 = (K ∗ a2 + CP ∗ b2)− ((K ∗ a0 + CP ∗ b0) ∗ (a2/a0)),

H2 = (K ∗ a1 + CP ∗ b1 − (K ∗ a2 + CP ∗ b2) ∗ (a1/a2)),

H3 = (a1 ∗K + CP ∗ b1)− (a1/a0) ∗ (K ∗ a0 + CP ∗ b0),

H4 = (K ∗ a0 + cp ∗ b0)− (a0/a2) ∗ (K ∗ a2 + cp ∗ b2),

H5 = −(1/a0) ∗ a2 − (1/a02) ∗ (a1

2),

H6 = (1/a02) ∗ a2 ∗ a1.

(5.23)

q0 = (H5 ∗H2 −H4 ∗H6)/(H3 ∗H2 −H4 ∗H1),

q1 = (H6 −H1 ∗ q0)/(H2),

p0 = 1/a0,

p1 = −(1 ∗ a1/(a02)− (K ∗ a0 + CP ∗ b0) ∗ (q0/a0)),

p2 = −((K ∗ a2 + CP ∗ b2)/(a2)) ∗ q1.

(5.24)

To sum up, in this section we have successfully derived the controller polynomial coeffi-

cients or parameters in terms of estimated plant parameters and other known variables. This set

of equations (5.24) eases the integration of numerical algebraic controller design into our cache

aware adaptive closed loop scheduling framework. Despite the fact that this method provides

fairly simple algebraic controller design strategy, the major drawback is the solid closed loop

response with pre-set fixed zero poles and this actually leads to the fast stabilization of process


output. To be more precise, fast stabilization requires control input with large peaks in a short

period of time; in other words, the faster the stabilization the higher the gain of the actuator.

Hence, in real application, this not always applicable due to saturation in physical components.

Hence, the pole placement controller design algorithm is proposed as a widely applicable and

flexible alternative.

Deadbeat Algorithm Complexity Analysis

For a fixed pole location at the origin, the deadbeat controller design algorithm is, in fact, a

solution of linear matrix equation Ax = b given in the Equation 5.22. Hence, the complexity

of the deadbeat algorithm is Θ(n3) with a Gaussian Elimination method and Θ(nn) using any

polynomial method. Due to the large algorithmic complexity of the deadbeat algorithm, the

system order is very critical in the computational efficiency of the overall framework.

5.2.2 Pole Placement Controller Design

The pole placement controller design algorithm derived a set of controller characteristic poly-

nomial coefficients for a reference pre-designed closed loop system response and given time

varying plant polynomial coefficients. In fact, the main add-on compared to deadbeat controller

design is the designer control specifying the closed loop system response through the location

of the poles of closed loop characteristic polynomial. In this way, it is possible for the designer

to ensure the stability as well as the required characteristics of closed loop response, such as

settling time, maximal overshoot, damping ratio. In fact, this adds complexity for the designer

to make design decisions on closed loop system behavior. On the other hand, this method

increases the flexibility and applicability of this approach on the various control scenarios.

In our project, the following steps are realized to apply pole placement method to our

controller design problem.

1. For the given state-space models (3.23) and (3.30), formulate the closed loop transfer

function:

NGDes

DGDes

= GDesired(z−1) =

Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))

P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)),

(5.25)


where A(z−1), B(z−1) refer to the cache access pattern (miss-count) dynamics transfer

function, P (z−1), Q(z−1) polynomials refer to the controller transfer function, CP is

cache miss penalty, and K = CPIIdeal + Delay(k) is sum of ideal cycle per instruction

for the thread and memory access delay; and GDesired(z−1) refers to desired closed loop

transfer function response.

2. From the closed loop transfer function (5.25), derive a number of solvable algebraic

polynomial system equations, so called Diophantine equations. Here, please note that

previously in the deadbeat controller case, the poles are placed at z=0. As a result,

A(z−1)P (z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗ B(z−1)) = 1 in the Equation (5.8).

However, since the specific pole location is also a possible design decisions for the

designer, it is necessary to consider the desired pole locations as well in the formulation

previously derived for Deadbeat Controller Design case. As a result, the Diophantine

equation is formulated as:

A(z−1)P (z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = DGDesired(z−1). (5.26)

3. Derive the error tracking function E(z−1)

E(z−1) = ICCache−Dedicated(z−1)− IC(z−1), (5.27)

where ICCacheDedicated(z−1) refers to reference cache dedicated instruction count, which

is in our case NIC(z−1)DIC(z−1)

, and IC(z−1) point to actual instruction count.

E(z−1) = [1− Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))

P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))]NIC(z−1)

DIC(z−1)

(5.28)

Considering the fact that the Diophantine is

P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = DGDes(z−1), (5.29)

and under the assumption that DIC(z−1) divides the numerator of tracking error transfer

function (5.28) and denotes this ratio as S(z−1), the second Diophantine equation is


deduced:

S(z−1) =DGDes(z

−1)−Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))

DIC(z−1),

which leads to Diophantine equation

S(z−1)DIC(z−1) +Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = DGDes(z−1).

(5.30)

4. Considering the Diophantine Equations (5.26), (5.30) and closed loop transfer function

(5.25), it is possible to reach a conclusion about the degrees of the polynomials. However,

in contrast with previous approaches, Diophantine equation (5.26) includes DGDes(z−1);

in fact, this term add the complexity to the system, particularly in determining polynomial

degrees of the controller ∂Q(z−1), ∂P (z−1) and closed loop transfer function degree.

According to Bobal et al. Bobal et al. [2005], the following criterion is considered:

∂DGDes(z−1) ≤ ∂A(z−1) + ∂max(A(z−1), B(z−1)) + 1− 1. (5.31)

If this condition is valid, then the same polynomial degree relation in deadbeat method

(5.14) is also valid for pole placement algorithm. In case the condition is not met,

∂P (z−1) and ∂Q(z−1) cannot be uniquely determined. In our case, according to the QR

RLS parameter estimation algorithm, the polynomials of the system (thread execution

pattern) plant are second order polynomials.

∂A(z−1) = ∂B(z−1) = 2. (5.32)

Hence, in order to meet criteria given in the Equation (5.31), ∂DGDes(z−1) ≤ 4 should

be valid.Under the assumption that that criterion is met, it is possible to calculate the

polynomial degrees for a known plant polynomial degrees as:

∂A(z−1) = 2, (5.33)

∂B(z−1) = 2, (5.34)

∂P (z−1) = 2, (5.35)

∂Q(z−1) = 1. (5.36)


For these polynomials, the closed loop system polynomial degree using the closed loop

system polynomial is calculated (5.25)

: ∂DGDes(z−1) = max((∂A(z−1) + ∂P (z−1)), ∂A(z−1) + 1 + ∂Q(z−1)). (5.37)

From this set of equations (5.37), it is obvious that the closed loop polynomial degree is

four. Hence, now it is necessary to design the desired fourth order closed loop response.

5. In the closed loop system pole assignment, the closed poles are designed in continuous

time domains based on the design decisions regarding damping ratio, sampling period

and oscillation frequency. In our project, the following design decision on closed loop

characteristics has been made:

Oscillation Frequency wn√1−ς2 275kHz

Damping Ratio ς 0.54

Sampling Period T0 1.5e-06sec

Maximum Overshoot ∆y 13.32 %

Using these design constraints, two complex conjugate poles and two real poles are

derived:s1,2 = −9.33 ∗ e05± 1.45 ∗ e06,

s3 = −1.0094 ∗ e06,

s4 = −7.5962 ∗ e05.

(5.38)

Here, please note that only complex conjugate poles affect the damping and overshoot

of the system. Real poles have no damping and oscillation impact on the closed loop

system response. Apart from the closed loop system response, the stability of the system

should be under designer consideration. As might be realized, all poles are on the left

hand side of the real axis. As an internal validation, the impulse response of the designed

closed loop system is validated, and also the stability related analysis is conducted on

this system. Figure 5.2 indicate the continuous time characteristic of closed system. In

Figure 5.2, the step response of the designed closed loop system is given; that is, how

fast and accurate cache aware adaptive closed loop scheduling framework is adapted to

the change in reference cache dedicated instruction count. As might be seen, the closed


Figure 5.2: Step Response of Closed Loop Transfer Function

loop system has a slow but accurate response to changes in the reference input. In fact,

the accuracy refers to the percentage of the overshoot (13%), and speed of convergence

refers to settling time in Figure 5.2.

The stability of the closed loop system is another important factor to be considered. In

this regard, Figure 5.3 refers to the measure of robustness; in other words, how far the

system is away of the state of instability. The first figure from the left is a root locus plot,

which indicates system response against the variation of the system gain. The other two

sub-figures are bode plots, and refer to the robustness of stability. In bode plots, there

are two terms which have significance in terms of our stability analysis. The first term is

GM, gain margin, which in fact refers to the measure of stability. GM is the measure of

controller gain or any other random system gain which can lead to instability; here in our

design, the stability is guaranteed up to 104dB gain. Phase margin (PM) is associated with

robustness of the closed loop stability; in other words, the measure of resistance of the

closed loop system against the undesired oscillations in the closed loop system. Hence,

the bigger the phase margin is, the less the closed loop system oscillates due to noise in

the input or system. Our design has a phase margin of 114 degrees, which is well above

the average PM of 40 degrees. After validating the closed loop performance and stability

in continuous domain for the designed poles, it is necessary to map the continuous poles

to the discrete time ones using the exponential mapping.


Figure 5.3: Closed Loop Stability Analysis Bode Plots & Root Locus


Following the performance validation, it is necessary to map continuous-time poles to the

discrete-time domain using continuous to discrete domain mapping strategies. Despite the

fact that there are many different transformation techniques, the bilinear transformation is

used in our project. In fact, the bilinear transformation uses simple exponential translation

between discrete and continuous time domains:

zi = exp siT0, (5.39)

where as mentioned previously T0 refers to the sampling period and it is equal 1.5e-06sec

in our system. As a result of the bilinear transformation, the following discrete-time poles

are derived:z1,2 = −0.4± 0.3i,

z3 = +0.22,

z4 = +0.32.

(5.40)

6. Using the designed pole location stated in the Equation (5.40), the characteristic polyno-

mial DGDes(z−1) of the closed loop system is calculated as:

DGDes(z−1) = 1− 0.2596z−1 − 0.0201z−2 − 0.0131z−3 + 0.0043z−4, (5.41)

where for a given general polynomial of DGDes(z−1) = d0 + d1z

−1 + d2z−2 + d3z

−3 +

d4z−4, it is possible to write polynomial coefficients as below:

d0 = 1,

d1 = −0.2596,

d2 = −0.0201,

d3 = −0.0131,

d4 = 0.0043.

(5.42)

In the rest of the discussion, rather than having numerical values for the DGDes(z−1),

these parameters given in the Equation (5.42) are used to have a more general solution,

applicable even for different closed loop system design.

7. Using the first Diophantine Equation (5.26) and facts about the degrees of polynomial


characteristic functions (5.37), it is feasible to write Diophantine Equations (5.26) in the

polynomial form:

(p0 + p1 ∗ z−1 + p2 ∗ z−2)(a0 + a1 ∗ z−1 + a2 ∗ z−2) + (q0 + q1 ∗ z−1)K

(a0 + a1 ∗ z−1 + a2 ∗ z−2) + (q0 + q1 ∗ z−1)CP (b0 + b1 ∗ z−1 + b2 ∗ z−2)

= d0 + d1z−1 + d2z

−2 + d3z−3 + d4z

−4.

(5.43)

8. Next is to write the controller transfer function coefficients in terms of the plant transfer

function coefficients. To achieve this, the Uncertain Coefficient Method is applied to our

characteristic polynomial given in the Equation (5.43). However, since the Uncertain Co-

efficient Method in fact solves the coefficients by equating the equal power of polynomial

coefficients, it is necessary to expand the polynomial system in the Equation (5.43) as

follows:

p0a0 + (p0a1 + p1a0)z−1 + (p0a2 + p1a1 + p2a0)z−2 + (p2a1 + p1a2)z−3

(p2a2)z−4 +Kq0a0z−1 + (Kq0a1 +Kq1a0)z−2 + (Kq1a1 +Kq0a2)z−3

+ (Kq1a2)z−4 + (CPq0b0)z−1 + (CPq0b1 + CPq1b0)z−2

+ (CPq1b1 + CPq0b2)z−3 + (CPq1b2)z−4 = d0 + d1z−1 + d2z

−2 + d3z−3 + d4z

−4.

(5.44)

9. At this stage, the Uncertain Coefficient method produces a system of nonlinear equations,

which are in fact the solution to the Equation (5.44). This system of nonlinear equations

can be expressed either as a set of nonlinear equations or in matrix form:

a0 0 0 0 0

a0 a0 0 (Ka0 + CPb0) 0

a2 a1 a0 (Ka1 + CPb1) (Ka0 + CPb0)

0 a2 a1 (Ka2 + CPb2) (Ka1 + CPb1)

0 0 a2 0 (Ka2 + CPb2))

p0

p1

p2

q0

q1

=

d0

d1

d2

d3

d4

, (5.45)


p0a0 = d0,

(p0a0 + p1a0 +Kq0a0 + CPq0b0) = d1,

(p0a2 + p1a1 + p2a0 +Kq0a1 +Kq1a0 + CPq0b1 + CPq1b0) = d2,

(p2a1 + p1a2 +Kq1a1 +Kq0a2 + CPq1b1 + CPq0b2) = d3,

(p2a2 +Kq1a2 + CPq1b2) = d4.

(5.46)

Both of these representations for the set of nonlinear equations actually provides an

opportunity to express unknown controller polynomial coefficients in terms of estimated

plant coefficients and reference closed loop characteristic polynomials. The minor dif-

ference is that while the matrix form given in the Equation (5.45) requires an inverse

matrix operation, second expression in the Equation (5.45) requires a series of algebraic

operations.

10. As a result of the Uncertain Coefficient method, the following set of expressions is

obtained for controller transfer function polynomial coefficients. Here, in order to express

controller polynomial coefficients in more understandable form, the following constants

are defined :

H1 = (K ∗ a2 + cp ∗ b2)− ((K ∗ a0 + cp ∗ b0) ∗ (a2/a0)),

H2 = (K ∗ a1 + cp ∗ b1− (K ∗ a2 + cp ∗ b2) ∗ (a1/a2)),

H3 = (a1 ∗K + cp ∗ b1)− (a1/a0) ∗ (K ∗ a0 + cp ∗ b0),

H4 = (K ∗ a0 + cp ∗ b0)− (a0/a2) ∗ (K ∗ a2 + cp ∗ b2),

H5 = d2− (d0/a0) ∗ a2− (a1/a0) ∗ d1 + (d0/a02) ∗ (a12)− a0 ∗ (d4/a2),

H6 = d3− d4 ∗ (a1/a2)− d1 ∗ (a2/a0) + (d0/a02) ∗ a2 ∗ a1.

(5.47)

In fact, these constants in the Equation (5.47) are used in the controller transfer function

polynomial coefficient given below :

q0 = (H5 ∗H2−H4 ∗H6)/(H3 ∗H2−H4 ∗H1),

q1 = (H6−H1 ∗ q0)/(H2),

p0 = d0/a0,

p1 = (d1/a0− (d0 ∗ a1/(a02))− (K ∗ a0 + cp ∗ b0) ∗ (q0/a0)),

p2 = (d4/a2)− ((K ∗ a2 + cp ∗ b2)/(a2)) ∗ q1.

(5.48)

5.3. CONCLUDING REMARKS 111

Pole placement(assignment) in fact addresses drawbacks of deadbeat method and give flexi-

bility to the designer in determining closed loop response characteristics such as damping ratio,

overshoot, settling time and etc. This is a significant improvement, because in some systems, it

is hardly possible to control closed loop response with an high accuracy in a very short period

of time due controller or actuator limitations. In the control terminology, this is identified as

the fast stabilization of closed loop system, which leads to a saturation in the actuator or the

controller. In contrast, in pole placement method it is always possible for designer to decide on

degrees of accuracy and response times for particular application domains and specific systems.

Hence, the pole placement algorithm is applicable to a wide range of applications. However,

it is also inevitable that the complexity and computational cost of the pole placement method

increases significantly from the designer point of view since closed loop pole design is involved.

Pole Placement Algorithm Complexity Analysis

For a pre-designed pole location, the pole placement controller design algorithm is, in fact, a

solution of linear matrix equation Ax = b given in the Equation 5.45. Hence, the complexity

of this linear matrix equation, in fact, equals to the complexity of the pole placement algorithm,

which is Θ(n3) with a Gaussian Elimination method and Θ(nn) using any polynomial method.

Due to the large algorithmic complexity of the pole placement algorithm, the system order is

very critical in the computational efficiency of the overall framework.


In this chapter, so far we have developed two alternative controller design methodologies with

different complexity and applicability. Despite the difference in both algorithms, both these

approaches in fact provide algebraic methods for finding controller polynomial coefficients or

parameters from a given plant and system parameters. In other words, these algebraic design

methods seek controller polynomial equations, which enforce the closed loop system toward the

reference or pre-set ideal closed loop characteristic for a given plant characteristic polynomial

at any given instant. In the context of our project, the algebraic controller design has a signifi-

cance since our cache dynamics are highly unpredictable and show time-varying characteristics.

Hence, it is necessary to estimate plant polynomials at a given instant, and passing these plant


characteristics as polynomial coefficients to algebraic control design methods. In this way, a

specific controller is designed to keep the overall closed loop response at the ideal or pre-set

characteristics. No matter what cache performance is at the specific instant, overall instruction

count executed within that time will be always same.

Chapter 6

Experimental Setup and Simulation

This chapter is an endeavor to validate the theoretical findings and derivations in line with our

project aim and objectives. Namely, the question of whether the framework is able to match the

performance criteria and objectives is addressed through this section. To be more specific, it is

necessary to clarify the research objectives.

1. Our main objective is to discard the processing inefficiency caused by shared cache

resource allocation. Particularly, the co-runner cache access pattern dependence on the

thread processing efficiency is targeted. Our framework is settled on the fact that it is

inevitable to have a cache miss on a shared cache memory due to the fact that shared cache

memory is a limited memory unit which is shared among application threads with tempo-

ral cache memory demands. In such a cache environment, co-runner thread cache access

behavior has a significant impact on the thread’s performance. Hence, we propose the

argument that the threads having high memory latency due to co-runner cache dependency

can be compensated by allocating more processing cycles to them. In this way, we aim to

improve the time completion; that is, the minimization of performance degradation of the

thread due to co-runner dependence or shared resource by tracking dedicated (reference)

instruction cycle count is our main goal in this project. In this chapter, this argument

is validated by comparing instruction count with traditional scheduler and our adaptive

scheduler.

2. Our second objective is to provide a proactive framework rather than a reactive one. This

capability requires prediction or estimation capability in the system. In this respect, our

adaptive scheduling framework includes a parameter estimation module, which estimates

113

114 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION

the thread cache access pattern, which determine the proactive control action to mini-

mize instruction count error of the thread. In this chapter this argument is validated, by

assessing the accuracy of cache miss estimation.

6.1 Experiment Design

In order to provide a consistent performance assessment, it is necessary to develop test scenar-

ios, which provide a consistent and comparable results. Ideal scenarios are able to reflect the

actual performance of a particular algorithm, and also their results should be comparable with

other algorithms having similar test cases. In general, any simulation or test scenario is in fact

characterized by three constraints: controlled, fixed and free constraints.

The controlled constraints are parameters, whose values are intentionally adjusted by the

experimenter in order to reach a particular conclusion. In our experiment, our controlled

parameter is co-runner threads or processes pairs. In contrast to the controlled constraints,

fixed constraints are the ones, whose values are kept constant throughout the experiment. The

main objective behind fixed constraints is to provide a consistency across experiment results. As

for free constraints, they are the ones, which are not controllable by the designer, and generally

assumed to have minor impact. In our experimental design strategy, two constraints fixed and

controlled are mainly used.

In our project context, controlled constraints aim to identify and observe different execution

patterns or resource demands of threads or processes. In fact, each different execution pattern

or resource demand corresponds to a particular adaptive scheduling scenario. Our adaptive

scheduling framework is considered as an efficient solution only if it can efficiently handle

all these scenarios. In this case, controlled constraints provides a complete picture on the

scheduling performance for different occurrences. Fixed constraints refers to the simulation

platform settings such as cache memory size, number of cores, cache replacement policy, com-

piler type, and some other computing platform settings. In our scenarios, cache size, number

of cores, cache replacement policies, core and cache architecture are kept fixed. In addition,

in our experiments, compiled benchmark instruction SPEC CPU2000 binaries are used; that

is, sample applications are compiled using the same specific instruction sets and converted

into binary/hardware codes. In this way, software platform dependence on performance is

6.2. EXPERIMENT SCENARIOS 115

eliminated. All in all, fixed constraints provide consistency among experiments, and also

allow only controlled constraints to have impact on the result of the experiment. That is, the

experiment designer keeps all constraints except controlled ones constant to observe the impact

of that particular constraint on the experiment.

Considering these experiment design elements, a number of experiments will be designed

to assess the performance characteristics of our scheduling framework. The following section

includes the development of experiments, and the rest of the chapter analyzes and evaluates the

experiment outputs.

6.2 Experiment Scenarios

6.2.1 Development of Experiment Constraints

In this section, the controlled and fixed constraints briefly mentioned in the previous section are

discussed in details in our project context.

Controlled Constraints

As mentioned previously, controlled constraints refers to the variant factors in the experiment.

In fact, the efficiency of the framework is measured with respect to these constraints. In our

experiment, co-runner pairs are considered as controlled constraints. Due to the fact that

each thread has a different memory and execution pattern, each co-runner thread pair shows a

different system behavior, and from the adaptive scheduling perspective, this requires different

identification complexity, different control action and different reference signal value. Hence,

each of the controlled pair of co-runner threads validates one particular aspect of the system. In

line with our project context, we create two sets of co-runner thread pairs according to cache

resource demands as:

1. Heavyweight Co-Runner Threads This pair of threads refers to the thread having a high

volume cache demand and showing non-temporal streaming cache behavior. Streaming

application such as video coding applications, computing platforms belongs to this class

of threads. For instance, SPEC CPU 2000 integer benchmark mcf requires up 1.7GB

memory requirement and swim also requires cache bigger that 16MB. Hence, swim mcf


pair can be considered as heightweight co-runner pairs. For heavyweight co-runner

threads, cache performance of these threads has significant impact on overall thread

performance. Hence, this class of co-runner threads has a significance in our cache-aware

adaptive scheduling for experiment purposes, and even further it can be considered as an

application domain for our cache-aware adaptive framework.

2. Lightweight Co-Runner Threads This pair of threads refers to a pair threads, which

is composed of two different threads with different cache requirements. While the first

thread of the pair shows significant cache demand, the second in the pair does not have

a significant cache resource demand and might show high temporal locality. As a result,

second thread cache performance hardly impact the first thread and the overall system

performance.

To sum up, controlled constraints actually define the system performance under variant

conditions on a particular platform. In this case, it is possible to have a more reasonable analysis

opportunity.

Fixed Constraints

Fixed constraints refer to the invariant factors throughout the experiment. In our experiment,

multi-core architecture elements such as cache replacement policy, CPU architectures such as

simple scalar and super-scalar architecture, cache memory size and many other architectural

elements are fixed constraints. Despite the fact that these factors might also have a different

impacts on the processor performance of the different threads, these impacts are omitted due

to the fact that our research investigates the negative impact of co-runner dependent cache

resources allocation on the thread processing efficiency rather than architectural differences and

their impacts on processor performance. Apart from processor architecture, there might also be

compiler related factors involved in the experiment output. Since our experimental approach

is an endeavor to provide a fair global comparison among existent scheduling approaches, it is

inevitable to use some standard experimental setup. In our case, SPEC CPU2000 benchmarking

setup is preferred. SPEC CPU2000 benchmarking setup in fact rules out discrepancies due

to the minor difference among compiler structures, software thread switching, or to sum up,

software platform related differences. SPEC CPU2000 benchmarking setup includes a set of

6.3. EXPERIMENTS 117

argument files, binary source code. Indeed, benchmarking is a controlled experiment strategy,

which allows experimenter to have one-to-one comparison with other experiment outputs.

All in all, our aim as experimenters here is to provide a much clearer and simpler picture by

fixing some constraints and considering only a few main constraints. In addition, benchmarking

strategies also enhance our capabilities in analyzing and comparing our experiment outputs with

the conventional algorithms.

6.2.2 Design Strategy:Two Stage Experiment

In this section, we are going to provide a series of different scenarios for our experiment

setup. Once again it is worth mentioning that benchmarking tools are utilized for the sake

of consistency, better analysis and comparison with existing approaches and algorithms. Our

experiment is a two-stage experiment as shown in Figure 6.1; that is, statistics for a thread

and its co-runner pair such as miss count and instruction count, which are retrieved using the

M-Sim simulator, are inputs to the MATLAB computing platform. Namely, while M-SIM

is a multi-core multiprocessor simulator generating cache and instruction statistics, adaptive

control framework develop on MATLAB computing platform is utilized in order to analyze the

performance of the overall adaptive scheduling framework.

M-SIM M-SIM, with an additional time-series statistic collector module, generates time series

statistics of cache miss and instruction count metrics of a thread and the co-runner threads.

Adaptive Self-Tuning Framework (MATLAB) For collected time-series statistics, adaptive

self-tuning framework enforces fair allocation of instruction count to a thread by adjusting

processor cycles of co-runner threads. Experiment outputs validate the efficiency of the

overall framework, as in how accurate and robust the overall framework for a given thread

pair.

6.3 Experiments

In our research, considering constraints and strategies referred above, a series of experiments is

conducted. These experiments aim to assess the performance of the cache aware adaptive closed

loop scheduling framework while underlining advantages and drawbacks of the framework with


different sets of constraints. This section is outlined in three subsections: experiment setup,

experiment outcome and analysis.

6.3.1 Experiment Setup

In this experiment, the cache aware adaptive closed loop scheduling framework, which is com-

posed of an adaptive self tuning control framework and multi-core multiprocessor scheduling

framework, is considered as a target scheduling framework. For a fixed set of multiprocessor

constraints and controlled co-runner pairs, our experiment setup assesses the performance of

the adaptive scheduling framework.

Firstly, multi-core multiprocessor platform related constraints are specified as in Table 6.1,

and for that particular platform configuration and the selected co-runner thread pair, time-series

miss count and instruction count statistics are retrieved using the M-SIM simulation platform.

Table 6.1 shows only a limited number of constraints, which have a direct impact on the

performance of the cache aware adaptive multiprocessor scheduling framework. The M-SIM

simulation gives time-series statistics of co-runner pair thread; more specifically, instruction

count vector (ic[k]) of the thread, miss count vector (mc[k]) of the thread, miss count vector

(cmc[k]) of the co-runner pair, and instruction count (icCache−Dedicated[k]) of the thread in a

dedicated cache platform, as shown in Figure 6.1.

As collected time series vectors are imported into MATLAB workspace, MATLAB m-files

are executed according to the logic given in Figure 6.2. For an imported time series statistics,

main module mainSTRPP2.m calls initdef.m function to initialize initial system variables. After

the initialization of system variables, plant parameters are recursively identified using QR-RLS

module identnorder.m function and corresponding controller parameters are designed using pole

placement module PolePP.m function; these two module are called by main module recursively

till the end of a simulation run. Meanwhile, plant parameters and controller parameters are

recursively updated and recorded. By this way, system performance indicators are continuously

monitored. These system performance indicators are analyzed in the following section, Ex-

periment Outcomes. Modules mainSTRPP2.m, identnorder.m and PolePP.m in Figure 6.2 are

in fact implementations of cache aware adaptive scheduling framework described in Chapter

3, QR Recursive Least Square online deterministic algorithm given in Chapter 4 and algebraic

pole placement method given in Chapter 5, respectively.


Table 6.1: Multi-Core CMP Architecture Constraints

Multiprocessor Element Constraints Specifications

L2 Data Cache

Number of Sets 512Block Size 64

Associativity 16Replacement Policy LRU

L2Penalty(L3Latency+PhyMemLatency) 30+60=90 cycles

L1 Data Cache



L1 Instruction Cache



SMT Multi-Core Processor

Number of Cores 4Max Context(Thread) Per Core 3

Superscalar Architecture(Instruction BW) 8 instructions/cycle

Figure 6.1: Experiment Setup


Figure 6.2: MATLAB Framework Simulation Flow

6.3.2 Experiment Outcomes

According to the experiment setup given in the previous section, for each SPEC CPU2000

benchmarking thread pairs given in Table 6.3, the performance indicators of the simulation

outcomes of the cache aware adaptive scheduling framework are derived.

The SPEC CPU2000 benchmarking suite offers a number of workloads, each of which

has different memory access characteristics and execution patterns. For a chosen co-runner

thread pair in Table 6.2, multi-core CMP and adaptive scheduling framework simulations are

conducted. The simulations outcomes are summarized in Table 6.3 per co-runner thread pairs.

In the experiment, we have defined two classes of co-runner threads pairs, heavy-weight and

light-weight. As mentioned previously, while light-weight co-runner pairs demonstrate com-

parable less cache dependent performance degradation, each of the heavy-weight pairs shows

high cache requirement; and as a result, the cache dependent performance is inevitable for this

pair. Relevant to our experiment, the first thread of each case always refers to the thread with

high cache requirement, such as mcf,gcc, applu and mesa. However, the second pair is selected

as one heavy-weight thread and the following light-weight thread. In this way, for both light-

weight and heavy-weight co-runner pairs performance variability of the adaptive scheduling is

investigated. Table 6.2 describes benchmark threads or workloads, and their applications.

Our experiments reveal that for selected co-runner pairs, it is possible to improve the average

instruction count of the specified thread so that any cache memory related processing delay is

compensated. The achieved improvement refers to how close the cache aware adaptive closed


Table 6.2: Workload/Thread Definition & Scope [KleinOsowski and Lilja, 2002]

Thread Definition Type Componentart art is image recognition/neural network thread. Floating Point

mcf mcf is a combinatorial optimization thread. Integergcc gcc is a C compiler. Integer

applu applu is parabolic/elliptic partial differential equations solver Floating Pointmesa mesa is 3D graphic library thread. Floating Pointvpr FPGA circuit placement and routing thread Integer

crafty crafty is gaming, particularly chess, thread. Integer

Table 6.3: Adaptive Scheduling Framework Experiments Results

Co-RunnerPairs

AverageMissCount(Convention-alScheduler)

AverageIC(Convention-alScheduler)

AverageˆMissCount

AverageIC(Adap-tiveScheduler)

AverageDedIC

mcf & art 35 331 48 551 654.54mcf & vpr 34 590 42 578 654.54gcc & art 5 271 4.50 375 615.55gcc & crafty 3.79 557 3.38 525 615.55applu & art 6 721 5.082 1138 1196applu & vpr 7 1137 6.04 1198 1196mesa & applu 23.82 397 32 933 757.15mesa & crafty 12 704 12.61 1135 757.15


loop scheduling framework brings the instruction count in the shared cache environment (multi-

ple threads) to instruction count in the dedicated cache environment (single thread). According

to Table 6.3, it is obvious that the framework ensures instruction count of the thread in a shared

cache environment converges to the instruction count in dedicated cache environment (multiple

threads) by allocating extra execution slot to the thread. However, as might be observed, the

instruction count increase for all threads is not same, and in some cases, there is even a small

degradation in instruction count of the thread. In fact, the actual instruction count distribution

with respect to time indicates the degree of persistency. The behavior of reference instruction

count and actual instruction count with respect to time impact the feasibility of the cache aware

adaptive closed loop scheduling framework and the experiment outcome accuracy. These issues

are referred in details in the Analysis and Evaluation of Experiment section.

6.3.3 Analysis and Evaluation of Experiments

In this section, detailed analysis of experiment outcome is conducted. Each benchmark result

given in the previous section is supported with time series analysis. The main objective is

to determine the feasibility, practicality, weakness and strength of the cache aware adaptive

closed loop scheduling framework. In this respect, instruction count and miss count time series

regression statistics of the conventional scheduler and the adaptive scheduling framework are

retrieved. Then, instruction count statistics are compared with reference to instruction count

statistics ,which is gathered for the same thread under a dedicated cache environment.

mcf-art Co-Runner Pair

This co-runner pair is identified as a heavy-weight co-runner pair. Hence, as it is stated in Table

6.3, high miss count and considerable instruction count difference between reference (dedicated

cache) and actual (shared cache) instruction counts is anticipated. In such a scenario, we expect

our adaptive framework is able to improve instruction count for the particular thread from co-

runner threads. In this case, mcf is the target thread out of this co-runner pair. In Figure 6.16,

cyan dots represents the actual instruction count data points as a result of conventional sched-

uler; whereas, magenta line plot is adaptive instruction count, which indicates mcf computing

performance in our adaptive scheduling framework. According to Figure 6.16, the instruction

count performance is increased significantly over a time. This also can be verified by Table 6.3.


Figure 6.3: Adaptive Instruction Count and Actual Instruction Count

Figure 6.4: Adaptive Instruction Count and Reference Instruction Count

As mentioned previously, our main goal in cache aware adaptive closed loop scheduling

framework is to drive instruction count of the target thread towards reference instruction count,

considered as a thread instruction count in a dedicated cache environment. Figure 6.17 indicates

that adaptive instruction count, red line plot, converges towards mean of reference instruction


statistics, which are cyan data points. This behavior can also be verified in statistics given in

Table 6.3.

Figure 6.5: Adaptive Miss Count and Actual Miss Count

Not only is the instruction count error but also the cache miss count considered for the

reference instruction count tracking. A controller action for an instruction tracking error is

reconstructed on each cycle based on the estimated cache miss statistics. It is worth mentioning

once again the relation between cache miss and instruction count in our framework. In our

framework, instruction count tracking error is compensated with number processor execution

cycles. In other words, the controller produces a controller output, processor execution cycles

corresponding to the instruction count tracking error. Instruction count is equal to the number of

cycles multiplied with cycle per instruction; and cycle per instruction is referring to ideal cycle

per instruction plus stalls due to our cache miss. Hence, in such a framework, the accuracy

of cache miss prediction is critical. Figure 6.5 illustrates the performance of this estimation

process. According to Figure 6.5, despite hikes and peaks in the transient state of the thread

execution period (approximately 150 cycles out of 1500 cycles), the estimation converges to the

actual miss count measurement distribution. Here, cyan points refer to the actual miss counts

while red plot is the miss count estimation. In fact, except 10 data points out of 1500, the

estimation looks reasonable. As a result, we can conclude that the actual processor cycle to

instruction count translation in our framework is accurate enough to reflect the actual system

performance.


Figure 6.6: Reference Instruction Count and Actual Instruction Count

So far, we have discussed the accuracy of our framework for a given co-runner thread pairs;

however, it is also necessary to consider the feasibility of the cache aware adaptive closed loop

scheduling framework for a given co-runner pairs. In this case, the reference instruction count

to actual instruction count can be an indicator. For an mcf-art pair, Figure 6.15 illustrates that

reference instruction count distribution, which refers to instruction count statistics in a dedicated

cache environment, is significantly larger than actual instruction count. This, in fact, points

out that there is a significant cache miss and as a result cache stall, which degrades cycles

per instruction and instruction count. Indeed, this analysis provides an initial guess as to the

applicability and feasibility of our framework for this co-runner pair.

mcf-vpr Co-Runner Pair

In this scenario, co-runner pair for the target thread mcf is changed with light-weight thread vpr.

This scenario underlines the performance limitations of the cache aware adaptive closed loop

scheduling framework for light-weight co-runner pairs. To start with, Figure 6.15 illustrates

the actual and reference instruction count time series distribution. In contrast to heavy-weight

co-runner instances given in Figure 6.15, actual instruction count is very close to the reference

instruction count. In other words, there is a significant limitation for our adaptive scheduling

framework since instruction tracking error is already close to zero. Hence, the framework has a



negligible control effort, so no major contribution. This fact can be validated through Table 6.3.


To be more specific, Figure 6.16 illustrates that in fact adaptive instruction count is equal to

and even smaller than the actual instruction count. The reason the instruction count is below the

actual one is considered as the result of the estimation/approximation process in our framework,

which in fact takes the mean of the actual process. In Figure 6.16, adaptive instruction count


plot is crossing through the middle of actual data points. Despite the fact that our adaptive

scheduling framework drives the adaptive instruction count toward the reference instruction

count as shown in Figure 6.17, in fact the small tracking error between the actual and the

reference instruction count decreases the control effort of the our cache aware adaptive closed

loop scheduling framework.


From Figure 6.18, miss count estimation performance of the adaptive scheduling framework

is satisfactory.

Another significant observation about these two cache pairs and target thread mcf is that

despite the fact cache miss counts of target thread mcf in both light-weight and heavy weight

co-runner pair are same, the actual instruction count is different. This in fact states that in the

heavy-weight co-runner case, both threads are cache demanding. In other words, the number

of cache slot allocation per thread is low. Hence, even the cache miss count is the same as the

light-weight co-runner pair, the instruction count is in fact limited by memory resources size

allocated per thread. This case also underlines the requirement of extra cache memory block

with additional processor cycle allocation.

In two cases so far discussed, the target thread instruction count and cache miss charac-

teristics for two different co-runner pairs, light-weight and heavy-weight are illustrated. In

other words, target thread performance is analyzed with respect to variant co-runner pairs, light



weight and heavy weight threads. The same conclusion and characteristics can be applied to the

other co-runner pairs given in Table 6.3. However, the mesa thread characteristics irrespective

of variant co-runner pairs looks unusual. To emphasize and investigate this, for a fixed light-

weight co-runner pair crafty, mesa and another target thread gcc instruction count and cache

miss characteristics are investigated.

mesa-crafty Co-Runner Pair

According to Table 6.2, mesa is a 3D library thread and crafty is a gamer thread, particularly

in artificial intelligence games. To begin with, Figure 6.15 indicates that the actual instruction

count statistics are very close to the reference instruction count due to the fact that crafty is a

light-weight co-runner pair. However, both actual instruction count and reference instruction

count characteristics of mesa indicate a distinct execution pattern, which has idle and active

periods as shown in Figure 6.15.

While in the active period significant amount of instruction count is observed, during passive

period no activity or significant drop in instruction count is observed. As a result of this

fast transitions between active and passive states, our algorithm fails to track the reference

instruction count, and actually produces a significantly larger adaptive instruction count as

shown in Figure 6.17. In summary, as a result of the rapid transition between passive state





instruction count and active state instruction count of the thread mesa as shown in Figure

6.16, the adaptive instruction count is aggressively increased by the framework. This leads

an extra instruction allocation around 25% greater than the reference instruction count. As for


the cache miss count estimation, Figure 6.18 illustrates a similar characteristic as in previous

cases. Despite these few points, adaptive miss count or estimated miss count actually tracks the

actual miss count measurements.

In summary, due to the highly nonlinear instruction count pattern of the thread mesa, at some

vertical transition points, where derivative approaches infinity, cache aware adaptive closed

loop scheduling framework responds aggressively such that adaptive instruction jumps over

reference instruction count. This highly nonlinear transition in instruction count can best be

explained in the context of 3D rendering application as transition from steady visual elements

and context to highly active visual elements and texture. For instance, peak highway traffic

rendering could be an intensive load for mesa; whereas, rendering a scene or view has a very

low computational resource requirement. Hence, this scenario indicates that there may be a

case, where a framework might have tracking error up to 25%, and underlines the fact that the

framework performance is also dependent on input data persistency or characteristic, which in

our case is reference instruction count.



gcc-crafty Co-Runner Pair

In contrast to mesa-crafty, gcc-crafty has a similar instruction count characteristic as the rest of

the light-weight co-runner pairs. Due to the light-weight co-runner pair crafty, target thread gcc

reference instruction count distribution and actual instruction count distribution are quite close

to each other as shown in Figure 6.15. This indicates that the actual tracking error is close to

zero.

Due to the fact that the instruction count is close to zero, our adaptive scheduler effort

to drive actual instruction to reference instruction count is limited; as a result, the adaptive

instruction count is smaller or equal to the actual instruction count as shown in Figure 6.16.

This is also indicated by Table 6.3.

Despite the fact that cache aware adaptive closed loop scheduling framework tracking the

reference instruction count as shown in Figure 6.17, in this instance, due to the fact that actual

instruction count of the target thread is very close to the reference instruction count, our adaptive

scheduling framework effort is negligible. In other words, instruction count error, difference

between reference instruction count and actual instruction count, is close to zero; as a result, the

corresponding controller output, execution time slots (processor cycles), will be considerably

small. In such a circumstance, cache aware adaptive closed loop scheduling framework’s impact





is negligible. As also mentioned previously in many instances, for a light-weighted co-runner

pair, our framework influence on the thread performance is limited.


As for the cache pattern (miss count) estimation, similar estimation capabilities are observed

as illustrated in Figure 6.18. In fact, this ensures that the cache aware adaptive closed loop

scheduling framework is not only tracking instruction count with respect to reference instruction

but also considering cache statistics during this process. In fact for a practical case, cache miss

is considered due to the fact that only processor cycles can be utilized as a controller output

by the processor; that is, for a calculated instruction count error, the corresponding processor

cycles are allocated through the thread cache access pattern. This is mainly because of the fact

that the processor itself is unaware of thread resource requirements until the thread is actually

executed. Thus, controller output should be an entity independent of thread execution behavior.


In summary, our cache aware adaptive closed scheduling framework can be considered as an

enforcement mechanism, which ensures that target thread instruction count tracks reference

instruction count, which is in our case instruction count in a dedicated computing environment.



Apart from tracking capability, the framework is also able to consider cache pattern of the thread

during this process to provide a proactive control rather than a reactive one. For a calculated

tracking error, the controller calculates the number of processor cycles, controller output, which

is required to drive this error to zero. This process also considers the cache pattern of the target

thread. The instruction count tracking error to processor cycle conversion is necessary due to the

fact that the processor is only capable of allocating static resources rather than dynamic ones

such as instruction counts dependent on each thread cache characteristic. Here, our adaptive

scheduling controller performs this mapping based on the estimated cache pattern of the target

thread. Then, the controller output (processor cycle) is allocated by the processor based on

the request of the controller. Considering this framework, a series of experiments is conducted

using the experiment design pattern mentioned in the Section 6.1. These experiments indicate

three different cases for the variant of co-runner pair threads.

1. For a heavy-weight co-runner thread pairs, our adaptive scheduling is successful in im-

proving instruction count of the target thread towards the reference instruction count.

2. For a light-weight co-runner thread pair, our adaptive scheduling performance is depen-

dent on how close the actual instruction count is to the reference instruction count. In

this experiment, the reference instruction count is selected as a target thread instruction

count in a dedicated cache environment. In this case, our cache aware adaptive closed


loop scheduling framework can be considered as a passive component in the processing

platform.

3. For a highly nonlinear and unstable target thread such as mesa, our scheduling framework

might have an error margin of up to 25 percent. In fact, this is completely dependent on

the co-runner pair instruction count statistics. Depending on persistency and consistency

of thread statistics, this error margin can be varied from 1 percent to 25 even 50 percent.

However, our experiment states that our cache aware adaptive closed loop scheduling

framework performs reasonably well for most of the threads.

As mentioned above, one of the significant contributions of this framework is in fact esti-

mating and considering thread cache patterns. This improves the robustness and practicability

of the framework. According to our experiment results, despite a small number of hikes and

jumps, cache pattern estimation is an efficient component in the overall framework. All in

all, experiment outcomes indicate that cache aware adaptive closed loop scheduling framework

is a reasonable solution, which considers cache pattern of the threads and compensates the

performance degradation due to inefficient cache allocation with an additional processor cycles

allocation.


Chapter 7

Conclusions and Recommendations

7.1 Summary

This thesis has successfully constructed a cache aware adaptive closed loop scheduling frame-

work, which in fact merges multidisciplinary research areas including the modern control theory

and computer systems. In line with our objectives and contributions stated in the introduction,

two major and three minor achievements can be underscored:

• First of all, the execution fairness algorithm has successfully been applied into the cache

aware adaptive scheduling framework. As previously stated in the introduction, this

algorithm has achieved an instruction count of actual thread to converge to cache ded-

icated (reference) instruction count. In this way, co-runner dependency on the thread

performance has been eliminated.

• A thread execution model has successfully been developed. This state-space model

has achieved time series formulation of coupled instruction count and cache resource

dynamics of the thread and co-runner threads. These sets of equations are in fact referring

to the thread execution and cache access patterns.

• Based on the developed thread execution model, QR Recursive Least Square (RLS)

parameter estimation algorithm has been successfully applied to estimate the instant cache

pattern of the thread. To achieve this, the regression cache miss model has been developed

using the time series statistics of miss count and co-runner miss count metrics. QR RLS

algorithm provides a set of polynomial coefficients representing the cache behaviors of

137

138 CHAPTER 7. CONCLUSIONS AND RECOMMENDATIONS

the thread at that particular time for the given statistics.

• Using estimated cache access and execution patterns of the thread, an algebraic controller

design algorithm (pole placement) has been successfully applied to the adaptive self-

tuning control framework, which enforces actual instruction count to track the cache

dedicated (reference) instruction count irrespective of cache patterns of co-runner and

actual thread. In the pole placement algebraic design algorithm, the designer has a

flexibility in determining the closed loop system response. In our case, stable cache

aware adaptive scheduling framework response has successfully been designed.

• The first four achievements can be considered as the theoretical foundation of the actual

adaptive self-tuning control framework design. However, the last achievement is the

development of the patch software module, which adds a few functional modules and

time series hit/miss counters to the M-Sim System Simulator.

Following the theoretical foundation of the cache aware adaptive closed loop scheduling

framework and the development of statistics retrieval tools, a number of experiments have been

conducted and the results are analyzed using SPEC CPU2000 benchmark threads. Based on

these observations, it is concluded that the framework is particularly efficient for co-runner

threads, which have high cache requirement, and relatively inefficient with co-runners with low

cache demands.

7.2 Future Work and Recommendations

Some potential research areas can be spotted on the basis of literature and research outcomes:

7.2.1 Heterogeneous Multiprocessor Architecture Resource Allocation Problem

In contrast homogeneous multiprocessor architecture, heterogeneous multiprocessor architec-

ture has a specific run-time fault handling, power management and computational resources

for each core. For instance, while one core has a low power consuming micro processor with

limited cache and processor resources, the other one can be a powerful graphical processing

core with high memory and processor resources. In such architecture, two important research

questions have been developed:

7.2. FUTURE WORK AND RECOMMENDATIONS 139

• How can threads be allocated among these heterogeneous cores such that maximum

resource efficiency can be achieved?

• What kinds of strategies can be developed to estimate thread resource requirements at run

time?

Due to the fact that specialization and corresponding resources of each cores are signifi-

cantly diversified, it is very critical to correctly allocate each thread to the suitable cores. Oth-

erwise, a significant degradation in thread and allocated core performance is inevitable. There

are two main approaches in core/resource allocation problems: static and dynamic approaches.

Hence, an efficient cache and resource management framework is essential. In other words, the

resource and role management on dynamic heterogeneous multicore processors (DHMPs) can

be a potential research field for future research work.

7.2.2 Statistical Pre-processing of Real-Time Statistical Information

Another challenge in any adaptive or dynamic framework is the estimation of the system state

based on the past measurements or states. Despite the fact statistical or deterministic estimation

methods have a straight forward approaches, there has been a significant number of measure-

ment samples with highly nonlinear and unpredictable patterns. This generally has a negative

impact on the estimation process efficiency, and indirectly on the overall system performance.

Hence, the pre-processing of statistics prior to forming a conclusion about the system state

is necessary to remove measurement samples, which are incoherent. In fact, this elimination

process ensures system stability and accuracy. In this field of research, there has been a vast

range of methods ranging from a simple filter to highly complicated data-mining algorithms.

7.2.3 Robust Adaptive Control Theory

Another potential research field, which is relevant to our research project, is the robust control

theory, which can be applied to increase the robustness of the system by defining error margins

(bounds). This analysis can be applied to degrade system control and estimation cost, especially

for adaptive self-learning systems. Namely, within the given error range, the system is consid-

ered having a fixed response even if there are a significant oscillations over that range. As a

result, adaptive self-learning cost can be reduced significantly. However, the derivation of these

140 CHAPTER 7. CONCLUSIONS AND RECOMMENDATIONS

bounds requires a significant mathematical and system analysis. Hence, computational cost of

our cache aware adaptive closed loop scheduling framework can be decreased by using these

mathematical tools.

7.2.4 Theoretical Analysis of the Scheduling Framework

As stated in previous chapters Chapter 4 and Chapter 5, algorithmic complexities of QR-RLS

and pole placement controller design algorithms are Θ(n2) and Θ(n3), respectively. Based on

this fact, the overall framework’s computational complexity is calculated as max(Θ(n3),Θ(n2))

= Θ(n3). Here, n refers to the system order; i.e the number of past measurement samples used

in derivation of current system states. In the thesis, discussions on the theoretical analysis of

overall framework and individual algorithms are not comprehensive due to the limited scope

and time frame of our Master research project. A potential future work is a comprehensive

theoretical analysis of these proposed algorithms; algebraic pole placement, QR Recursive Least

Square (RLS). This analysis includes complexity and stability analysis of the algorithms as well

as the overall scheduling framework.

QR-RLS Algorithm The Fast QR-RLS algorithm is a potential algorithm, which significantly

improves the complexity of online learning or parameter identification module from Θ(n2)

to Θ(n).

Pole Placement Algorithm The Pole placement algorithm has a high computational complex-

ity compared to the QR-RLS algorithm; hence, the placement algorithm can be consid-

ered as a computational bottleneck of the overall framework. Further theoretical analysis

and different computational approaches are necessary to reduce the complexity of the

pole placement algorithm.

Literature Cited

Anderson, J. H. and Srinivasan, A. (1999). A new look at pfair priorities. Technical report, In

Submission.

Anderson, J. H. and Srinivasan, A. (2000). Early release fair scheduling. 12th Euromicro

Conference on Real-Time Systems, 0:35.

Anthes, G. H. (2000). Cache memory. Computerworld, 34(14):62.

Apolinario, J. A. J. and Miranda, M. D. (2009). QRD-RLS Adaptive Filtering. Springer

Science+Business Media, LLC.

Astrom, K. J. and Wittenmark, B. (1994). Adaptive Control. Addison-Wesley Longman

Publishing Co., Inc., Boston, MA, USA.

Baruah, S., Gehrke, J., and Plaxton, C. (1995). Fast scheduling of periodic tasks on multiple

resources. In Proceedings 9th International Parallel Processing Symposium, 1995, pages 280

–288.

Baruah, S. K., Cohen, N. K., Plaxton, C. G., and Varvel, D. A. (1996). Proportionate progress:

A notion of fairness in resource allocation. Algorithmica, 15:600–625.

Bertogna, M., Cirinei, M., and Lipari, G. (2009). Schedulability analysis of global scheduling

algorithms on multiprocessor platforms. IEEE Trans. Parallel Distrib. Syst., 20(4):553–566.

Bobal, V., Bohm, J., and amd Jiri Machacek, J. F. (2005). Digital Self-tuning Controllers.

Springer-Verlag.

Bohlin, T. (2006). Practical Grey-box Process Identification: Theory and Applications

(Advances in Industrial Control). Springer-Verlag New York, Inc., Secaucus, NJ, USA.

141

142 LITERATURE CITED

Brinkschulte, U. and Pacher, M. (2008). A control theory approach to improve the real-time

capability of multi-threaded microprocessors. In ISORC ’08: Proceedings of the 2008 11th

IEEE Symposium on Object Oriented Real-Time Distributed Computing, pages 399–404,

Washington, DC, USA. IEEE Computer Society.

Camacho, E. F. and Bordons, C. (1999). Model Predictive Control. Springer-Verlag.

Corporation, M. (2003). Msdn library. ASP.net.

Cottet, F., Delacroix, J., Kaiser, C., and Mammeri, Z. (2002). Scheduling in Real-Time Systems.

John Wiley & Sons, Chichester.

Diniz, P. S. (2008). Adaptive Filtering Algorithms and Practical Implementation. Springer

Science+Business Media, LLC.

Ebrahimi, E., Lee, C. J., Mutlu, O., and Patt, Y. N. (2010). Fairness via source throttling:

a configurable and high-performance fairness substrate for multi-core memory systems. In

ASPLOS ’10: Proceedings of the fifteenth edition of ASPLOS on Architectural support for

programming languages and operating systems, pages 335–346, New York, NY, USA. ACM.

Elliott, P. D. (2009). Bilinear Control Systems Matrices in Action, volume 169 of Applied

Mathematical Sciences. Springer Science+Business Media.

Fedorova, A., Seltzer, M., and Smith, M. D. (2006). Cache-fair thread scheduling for multicore

processors. Technical report, Harvard University.

Green, M. and Limebeer, D. J. N. (1995). Linear robust control. Prentice-Hall, Inc., Upper

Saddle River, NJ, USA.

Intel (2006). Dual-Core Update to the Intel Itanium 2 Processor Reference Manual. Intel

Corporation.

Ioannou, P. and Fidan, B. (2006). Adaptive Control Tutorial, volume 11 of Advances in Design

and Control. Society for Industrial and Applied Mathematics.

Itzkovitz, A., Schuster, A., and Shalev, L. (1998). Thread migration and its applications in

distributed shared memory systems. Journal of Systems and Software, 42(1):71 – 87.

LITERATURE CITED 143

Jahre, M. and Natvig, L. (2009). A light-weight fairness mechanism for chip multiprocessor

memory systems. In CF ’09: Proceedings of the 6th ACM conference on Computing frontiers,

pages 1–10, New York, NY, USA. ACM.

Jain, R., Chiu, D.-M., and Hawe, W. (1998). A quantitative measure of fairness and

discrimination for resource allocation in shared computer systems. CoRR, cs.NI/9809099:1–

32.

Kent, A. and Williams, J. G. (1997). Encyclopedia of Computer Science and Technology:

Supplement 21, volume 36 of Encyclopedia of Computer Science and Technology. CRC

Press.

Kim, S., Chandra, D., and Solihin, Y. (2004). Fair cache sharing and partitioning in a chip

multiprocessor architecture. In PACT ’04: Proceedings of the 13th International Conference

on Parallel Architectures and Compilation Techniques, pages 111–122, Washington, DC,

USA. IEEE Computer Society.

KleinOsowski, A. J. and Lilja, D. J. (2002). Minnespec: A new spec benchmark workload for

simulation-based computer architecture research. IEEE Comput. Archit. Lett., 1:7–.

Kroft, D. (1981). Lockup-free instruction fetch/prefetch cache organization. In ISCA ’81:

Proceedings of the 8th annual symposium on Computer Architecture, pages 81–87, Los

Alamitos, CA, USA. IEEE Computer Society Press.

Ljung, L. (1998). System Identification: Theory for the User (2nd Edition). Prentice Hall PTR.

Mak, P., Blake, M. A., Jones, C. C., Strait, G. E., and Turgeon, P. R. (1997). Shared-cache

clusters in a system with a fully shared memory. IBM J. Res. Dev., 41(4-5):429–448.

Matick, R. E., Heller, T. J., and Ignatowski, M. (2001). Analytical analysis of finite cache

penalty and cycles per instruction of a multiprocessor memory hierarchy using miss rates and

queuing theory. IBM J. Res. Dev., 45(6):819–842.

Microsystems, S. (2006). OpenSPARC T1 Microarchitecture Specication. Sun Microsystems,

SunMicrosystems, Inc., 4150Network Circle, Santa Clara, California 95054,U.S.A, first

edition.


Ogata, K. (1997). Modern control engineering (3rd ed.). Prentice-Hall, Inc., Upper Saddle

River, NJ, USA.

Olukotun, K. (2007). Chip Multiprocessor Architecture: Techniques to Improve Throughput

and Latency. Morgan and Claypool Publishers.

Paraskevopoulos, P. (2002). Modern Control Engineering. Control Series. Marcel Dekker, Inc.,

270 Madison Avenue, New York, NY 10016.

Pfenning, F. and Barbic, J. (2007). Multi-core architecture.

Siddha, S., Pallipadi, V., and Mallick, A. (2007). Process scheduling challenges in the era of

multi-core processors. Intel Technology Journal, 11(4):361–370.

SiliconGraphicsLibrary (2000). Origin2000 and onyx2 performance tuning and optimization

guide. HTML.

Srikantaiah, S., Kandemir, M., and Wang, Q. (2009). Sharp control: controlled shared cache

management in chip multiprocessors. In MICRO 42: Proceedings of the 42nd Annual

IEEE/ACM International Symposium on Microarchitecture, pages 517–528, New York, NY,

USA. ACM.

Stacpoole, R. and Jamil, T. (2000). Cache memories. Potentials, IEEE, 19(2):24 –29.

Stallings, W. (2003). Computer Organization and Architecture: Designing for Performance

(7th Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

Suh, G. E., Devadas, S., and Rudolph, L. (2001). Analytical cache models with applications

to cache partitioning. In ICS ’01: Proceedings of the 15th international conference on

Supercomputing, pages 1–12, New York, NY, USA. ACM.

Tam, D. K., Azimi, R., Soares, L. B., and Stumm, M. (2009). Rapidmrc: approximating l2 miss

rate curves on commodity systems for online optimizations. In ASPLOS ’09: Proceeding of

the 14th international conference on Architectural support for programming languages and

operating systems, pages 121–132, New York, NY, USA. ACM.

Tanenbaum, A. S. (2005). Structured Computer Organization (5th Edition). Prentice Hall, 5

edition.

LITERATURE CITED 145

Thimmannagari, C. (2008). Cpu Design: Answers to Frequently Asked Questions. Springer

Publishing Company, Incorporated.

Velusamy, S., Sankaranarayanan, K., Parikh, D., Abdelzaher, T., and Skadron, K. (2002).

Adaptive cache decay using formal feedback control. In In Proceedings of the 2002 Workshop

on Memory Performance Issues.

Vetter, S., Filhol, B., Kim, S., Linzmeier, G., and Plachy, O. (2006). Ibm system p5 quad-core

module based on power5+ technology. Technical overview and introduction, IBM Corp, BM

Corporation, International Technical Support Organization , Dept. JN9B Building 905, 11501

Burnet Road a,Austin, Texas 78758-3493 U.S.A.

Villani, P. (2001). Programming Win32 under the API. CMP Books, CMP Media, Inc.,

Publishers Group West, 1700 Fourth Street, Berkley, CA 94710.

Zhou, B., Qiao, J., and Lin, S. (2009a). Research on synthesis parameter real-time scheduling

algorithm on multi-core architecture. In CCDC’09: Proceedings of the 21st annual

international conference on Chinese control and decision conference, pages 5152–5156,

Piscataway, NJ, USA. IEEE Press.

Zhou, X., Chen, W., and Zheng, W. (2009b). Cache sharing management for performance

fairness in chip multiprocessors. In PACT ’09: Proceedings of the 2009 18th

International Conference on Parallel Architectures and Compilation Techniques, pages 384–

393, Washington, DC, USA. IEEE Computer Society.