Adaptive Cache Aware Multiprocessor Scheduling
Framework(Research Masters)
A THESIS SUBMITTED TO
THE FACULTY OF SCIENCE AND TECHNOLOGY
OF QUEENSLAND UNIVERSITY OF TECHNOLOGY
IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF
RESEARCH MASTER
Huseyin Gokseli Arslan
Faculty of Science and Technology
Queensland University of Technology
September 2011
Copyright in Relation to This Thesis
c© Copyright 2011 by Huseyin Gokseli Arslan. All rights reserved.
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet requirements for an
award at this or any other higher education institution. To the best of my knowledge and belief,
the thesis contains no material previously published or written by another person except where
due reference is made.
Signature:
Date:
i
ii
This thesis is dedicated to my dearest family and my beloved one
for their love, endless support.
iii
iv
Abstract
Computer resource allocation represents a significant challenge particularly for multiprocessor
systems, which consist of shared computing resources to be allocated among co-runner pro-
cesses and threads. While an efficient resource allocation would result in a highly efficient
and stable overall multiprocessor system and individual thread performance, ineffective poor
resource allocation causes significant performance bottlenecks even for the system with high
computing resources. This thesis proposes a cache aware adaptive closed loop scheduling
framework as an efficient resource allocation strategy for the highly dynamic resource manage-
ment problem, which requires instant estimation of highly uncertain and unpredictable resource
patterns.
Many different approaches to this highly dynamic resource allocation problem have been
developed but neither the dynamic nature nor the time-varying and uncertain characteristics of
the resource allocation problem is well considered. These approaches facilitate either static and
dynamic optimization methods or advanced scheduling algorithms such as the Proportional Fair
(PFair) scheduling algorithm. Some of these approaches, which consider the dynamic nature
of multiprocessor systems, apply only a basic closed loop system; hence, they fail to take the
time-varying and uncertainty of the system into account. Therefore, further research into the
multiprocessor resource allocation is required.
Our closed loop cache aware adaptive scheduling framework takes the resource availability
and the resource usage patterns into account by measuring time-varying factors such as cache
miss counts, stalls and instruction counts. More specifically, the cache usage pattern of the
thread is identified using QR recursive least square algorithm (RLS) and cache miss count
time series statistics. For the identified cache resource dynamics, our closed loop cache aware
adaptive scheduling framework enforces instruction fairness for the threads. Fairness in the
v
context of our research project is defined as a resource allocation equity, which reduces co-
runner thread dependence in a shared resource environment. In this way, instruction count
degradation due to shared cache resource conflicts is overcome.
In this respect, our closed loop cache aware adaptive scheduling framework contributes to
the research field in two major and three minor aspects. The two major contributions lead
to the cache aware scheduling system. The first major contribution is the development of
the execution fairness algorithm, which degrades the co-runner cache impact on the thread
performance. The second contribution is the development of relevant mathematical models,
such as thread execution pattern and cache access pattern models, which in fact formulate the
execution fairness algorithm in terms of mathematical quantities.
Following the development of the cache aware scheduling system, our adaptive self-tuning
control framework is constructed to add an adaptive closed loop aspect to the cache aware
scheduling system. This control framework in fact consists of two main components: the
parameter estimator, and the controller design module. The first minor contribution is the
development of the parameter estimators; the QR Recursive Least Square(RLS) algorithm is
applied into our closed loop cache aware adaptive scheduling framework to estimate highly
uncertain and time-varying cache resource patterns of threads. The second minor contribution
is the designing of a controller design module; the algebraic controller design algorithm, Pole
Placement, is utilized to design the relevant controller, which is able to provide desired time-
varying control action. The adaptive self-tuning control framework and cache aware scheduling
system in fact constitute our final framework, closed loop cache aware adaptive scheduling
framework. The third minor contribution is to validate this cache aware adaptive closed loop
scheduling framework efficiency in overwhelming the co-runner cache dependency. The time-
series statistical counters are developed for M-Sim Multi-Core Simulator; and the theoretical
findings and mathematical formulations are applied as MATLAB m-file software codes. In this
way, the overall framework is tested and experiment outcomes are analyzed. According to our
experiment outcomes, it is concluded that our closed loop cache aware adaptive scheduling
framework successfully drives co-runner cache dependent thread instruction count to co-runner
independent instruction count with an error margin up to 25% in case cache is highly utilized.
In addition, thread cache access pattern is also estimated with 75% accuracy.
vi
Keywords
Multiprocessor Scheduling, Adaptive Control Theory, Recursive Least Square, Cache-Aware
Adaptive Scheduling Framework
vii
viii
Acknowledgments
I gratefully acknowledge the contributions of my principal supervisor, Assoc. Prof Glen Tian
and associate supervisor Dr. Ross Hayward and Queensland University of Technology.
ix
x
Table of Contents
Abstract v
Keywords vii
Acknowledgments ix
Nomenclature xv
List of Figures xxiv
List of Tables xxv
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Scope and Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Literature Review 13
2.1 Multi-Core Chip Level Multiprocessor System Architecture . . . . . . . . . . . 13
2.1.1 Core Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Core Diversity and Parallelism . . . . . . . . . . . . . . . . . . . . . . 17
xi
2.2 Cache Architecture and Policies . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Cache Performance Indicator: Cache Miss . . . . . . . . . . . . . . . 21
2.3 Multiprocessor Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Multiprocessor Scheduling Taxonomy . . . . . . . . . . . . . . . . . . 23
2.3.2 Real-Time Multi-Core Multiprocessor Scheduling Algorithms . . . . . 25
2.4 Cache-Aware Multi-Core Chip Level Multiprocessor Scheduling . . . . . . . . 27
2.4.1 Cache-Fair Multi-Core CMP Scheduling . . . . . . . . . . . . . . . . 27
2.4.2 Adaptive Cache-Aware CMP Scheduling . . . . . . . . . . . . . . . . 40
2.5 Modern Control Theory for Scheduling Problems . . . . . . . . . . . . . . . . 47
2.5.1 Adaptive Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5.3 Control System Design Algorithms . . . . . . . . . . . . . . . . . . . 53
3 System Model Adaptive Control 57
3.1 Theoretical Background of Dynamic System Model . . . . . . . . . . . . . . . 57
3.1.1 State-Space Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1.2 Adaptive Control Theory . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Development of Thread Execution Pattern Model . . . . . . . . . . . . . . . . 64
3.3 Control Framework Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4 Adaptive Control Framework Development . . . . . . . . . . . . . . . . . . . 71
3.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 Parameter Estimation in Adaptive Control 75
4.1 Theoretical Background of Parameter Estimation . . . . . . . . . . . . . . . . 75
4.2 Least Square Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3 Adaptive Weighted QR Recursive Least Square Algorithms . . . . . . . . . . . 79
4.3.1 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . 79
xii
4.3.2 Formulation and Theoretical Conclusions . . . . . . . . . . . . . . . . 81
4.3.3 Complexity Analysis of QR-RLS Algorithm . . . . . . . . . . . . . . 92
4.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Algebraic Controller Design Methods 93
5.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Theoretical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.1 Deadbeat Controller Design . . . . . . . . . . . . . . . . . . . . . . . 96
5.2.2 Pole Placement Controller Design . . . . . . . . . . . . . . . . . . . . 102
5.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6 Experimental Setup and Simulation 113
6.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2 Experiment Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2.1 Development of Experiment Constraints . . . . . . . . . . . . . . . . . 115
6.2.2 Design Strategy:Two Stage Experiment . . . . . . . . . . . . . . . . . 117
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.3.2 Experiment Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.3 Analysis and Evaluation of Experiments . . . . . . . . . . . . . . . . . 122
6.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7 Conclusions and Recommendations 137
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Future Work and Recommendations . . . . . . . . . . . . . . . . . . . . . . . 138
7.2.1 Heterogeneous Multiprocessor Architecture Resource Allocation Problem138
7.2.2 Statistical Pre-processing of Real-Time Statistical Information . . . . . 139
7.2.3 Robust Adaptive Control Theory . . . . . . . . . . . . . . . . . . . . . 139
7.2.4 Theoretical Analysis of the Scheduling Framework . . . . . . . . . . . 140
xiii
Bibliography 141
xiv
Nomenclature
Abbreviations
ALU Arithmetic Logic Unit
AppC Application Controller
APPC Adaptive Pole Placement
ATD Auxiliary Tag Directory
CMP Chip Level Multiprocessor
CMR Co-runner Miss Rate
CP Cache Miss Penalty
CPI Cycle Per Instruction
BQ Bus Queue
BQD Buffer Queue Delay
DARMA Dynamic Auto Regressive Moving Average
DBQ Data Bus Queue
DM Deadline-Monotonic (Scheduling)
DMHA Dynamic Miss Handling Architecture
EDF Earliest-Deadline-First (Scheduling)
ER-PF Early Release Proportional Fair (Scheduling)
ER-PD Early Release Predictive Deadline (Scheduling)
FAMHA Fair Adaptive Miss Handling Architecture
FCP Finite Cache Penalty
FIFO First in First Out
xv
FP Fixed Priority
FS Fair Speedup
FSI Fair Speedup Improvement
IC Instruction Count
ILP Instruction Level Parallelism
IPC Inter Processor Communication
IPs Interference Points
LFU Least Frequently Used (Scheduling)
LLF Least Laxity First (Scheduling)
LRU Least Recently Used (Scheduling)
LS Least Square
LTI Linear Time Invariant
LTV Linear Time Varying
MAD Memory Access Delay
MHA Miss Handling Architecture
MIMO Multiple Input Multiple Output
MLP Memory Level Parallelism
MSHR Miss Status Holding Register
MR Miss Rate
MRC Miss Rate Curves
MRU Most Recently Used
NUMA Non uniform Memory Access
PAN Pre-Actuation Negotiator
PD Pseudo Deadline (Scheduling)
PF Proportional Fair (Scheduling)
Pfair Proportional Fair (Scheduling)
PI Proportional-Integral (Control/Controller)
xvi
PID Proportional-Integral-Derivative (Control/Controller)
RBQ Request Bus Queue
RLS Recursive Least Square
RM Rate-Monotonic (Scheduling)
SD Service Differentiation
SHARP Shared Resource Partitioning
SIMD Single Instruction Multiple Data
SISO Single Input Single Output
SMP Symmetric Multiprocessor
SMT Simultaneous Multithreading
SPM Static Parametric Model
UMA Uniform Memory Access
VLIW Very Long Instruction Word
VT Vertical Threading
Symbols Chapter
A,B,C System State Space Model Coefficient Matrices Ch3
A System Plant Denominator Polynomial Ch5
ai ith coefficient term of A Ch5
Ac Desired Closed Loop Characteristic Polynomial Ch5
Ay Maximum Overshoot (Closed Loop Characteristic) Ch5
B System Plant Numerator Polynomial Ch5
bi ith coefficient term of B Ch5
C Controller Ch3
CMC Co-Runner Miss Count Vector Ch3,5
cmc Co-Runner Miss Count Ch3,5
DGDes Desired Closed Loop Denominator Polynomial Ch5
xvii
DIC ICCacheDedicated Denominator Polynomial Ch5
E Error Tracking Function Ch5
fe(x, θ) Prediction Error Distribution Ch4
ICCacheDedicated Cache Dedicated Closed Loop System Ch5
ICError Instruction Count Error Ch3
G System Plant Transfer Function Ch3
GDesired Desired Closed Loop Transfer Function Ch5
G.M Gain Margin Ch5
GP Guaranteed Percentage Ch2
linf Infinity Norm Ch3
l2 Euclidean Norm Ch3
L(q) Linear Filter Ch4
M∗ System Model Set Ch4
M(θ) System Model Set Element Ch4
M Fairness Metric Ch2
m Normalization Signal Ch3
MC Miss Count Vector Ch3
mc Miss Count Ch3
mc(k) Miss Count Estimate Ch4
mc(k) Miss Count Estimate Vector Ch4
mcq(k) Triangularized Miss Count Vector Ch4
MPR Miss Prediction Rate Ch2
Mperf Performance Fairness Metric Ch2
MMiss Cache Fairness Metric Ch2
NGDes Desired Closed Loop Numerator Polynomial Ch5
NIC ICCache−Dedicated Numerator Polynomial Ch5
Q(k) Givens Rotation Matrix Ch4
xviii
Qθi(k) Givens Rotation Matrices with Rotation Angle θi Ch4
Q Controller Transfer Function Numerator Polynomial Ch5
qi ith coefficient term of Q Ch5
P Controller Transfer Function Denominator Polynomial Ch5
pi ith coefficient term of P Ch5
P outi Performance Metric Ch2
P refi Performance Reference Metric Ch2
R Autocorrelation Matrix Ch4
si Continuous Time Roots of Polynomial Ch5
Tded Execution Time Dedicated Cache Environment Ch2
Tovl Overlap Operation Cycles Ch2
To Sampling Period Ch5
Tpri Private Operation Cycles Ch2
Tshr Execution Time Shared Cache Environment Ch2
Tvul Vulnerable Operation Cycles Ch2
tr trace Ch4
u Control Input Ch3, Ch5
uc Input Command (Closed Loop) Ch5
U(k) Triangularized Input Data Matrix Ch4
V [k] Random Noise Component Ch3
VN(θ, ZN) Norm or Criterion Cost Function Ch4
wi Requested Cache Ways Ch2
W Available Cache Ways Ch2
W Parameter Weight Vector Ch4
W [k] Random Noise Component Ch3
wn Normalized Frequency Ch5
w(k) Plant Parameter Vector (QR RLS Algorithm) Ch4
xix
w Plant Parameter Coefficients Ch4
x State Space Variable Ch2,3
x State Space Vector Ch2,3
x State Space Variable Estimate Ch2,3
y System Output Ch3,4,5,6
y(t|θ) Prediction Ch4
zi Discrete Time Polynomial Roots Ch5
δCPUCY CLE Additional CPU Cycles Ch3
‖‖∞ Infinity Norm Ch3
Greek Letters
ϕi(∞) Ideal Instruction Per Cycle (wi=∞) Ch2
ψi(t) Predicted Number of Cache Ways Ch2
Θ Weighting Factor Ch2
θ(t) Plant Parameter Vector (Adaptive Control) Ch3
θ(t)∗ Unknown Plant Parameter Vector (Adaptive Control) Ch3
θi Givens Rotation Anles Ch4
ϕ Regression Input (Regressor) Ch3,4
ϕ(k) Input Regression Vector Ch4
ψ Input Data Matrix Ch4
φ Regression Vector Ch3
Γ Adaptive Gain Ch3
ε(t, θ∗) Model Prediction Error Ch4
ε error vector (Least Square) Ch4
ε Posterior Error Vector Ch4
ξd(k) RLS Cost Function Ch4
ζ Damping Ratio Ch5
xx
xxi
xxii
List of Figures
2.1 Instruction Issue and Cache Miss for a Single Threaded Processor and 2 Threaded
Processor Supporting SMT[Thimmannagari, 2008] . . . . . . . . . . . . . . . 15
2.2 4-Way Set Associative Cache and Main Memory . . . . . . . . . . . . . . . . 20
2.3 Thread CPU Latency vs Cache Allocation for Different Scenarios [Fedorova
et al., 2006] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 SHARP Control Architecture [Srikantaiah et al., 2009] . . . . . . . . . . . . . 43
2.5 Block Diagram Adaptive System [Astrom and Wittenmark, 1994] . . . . . . . 48
3.1 Indirect(Explicit) Adaptive Control [Ioannou and Fidan, 2006] . . . . . . . . . 63
3.2 Direct(Implicit) Adaptive Control [Ioannou and Fidan, 2006] . . . . . . . . . . 64
3.3 Thread Execution Pattern Model . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Closed-Loop CMP Control System Model . . . . . . . . . . . . . . . . . . . . 70
5.1 A General Linear Controller with Two Degrees of Freedom [Astrom and Wit-
tenmark, 1994] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Step Response of Closed Loop Transfer Function . . . . . . . . . . . . . . . . 106
5.3 Closed Loop Stability Analysis Bode Plots & Root Locus . . . . . . . . . . . . 107
6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 MATLAB Framework Simulation Flow . . . . . . . . . . . . . . . . . . . . . 120
6.3 Adaptive Instruction Count and Actual Instruction Count . . . . . . . . . . . . 123
6.4 Adaptive Instruction Count and Reference Instruction Count . . . . . . . . . . 123
6.5 Adaptive Miss Count and Actual Miss Count . . . . . . . . . . . . . . . . . . 124
xxiii
6.6 Reference Instruction Count and Actual Instruction Count . . . . . . . . . . . 125
6.7 Reference Instruction Count and Actual Instruction Count . . . . . . . . . . . 126
6.8 Adaptive Instruction Count and Actual Instruction Count . . . . . . . . . . . . 126
6.9 Adaptive Instruction Count and Reference Instruction Count . . . . . . . . . . 127
6.10 Adaptive Miss Count and Actual Miss Count . . . . . . . . . . . . . . . . . . 128
6.11 Reference Instruction Count and Actual Instruction Count . . . . . . . . . . . 129
6.12 Adaptive Instruction Count and Actual Instruction Count . . . . . . . . . . . . 129
6.13 Adaptive Instruction Count and Reference Instruction Count . . . . . . . . . . 130
6.14 Adaptive Miss Count and Actual Miss Count . . . . . . . . . . . . . . . . . . 131
6.15 Reference Instruction Count and Actual Instruction Count . . . . . . . . . . . 132
6.16 Adaptive Instruction Count and Actual Instruction Count . . . . . . . . . . . . 132
6.17 Adaptive Instruction Count and Reference Instruction Count . . . . . . . . . . 133
6.18 Adaptive Miss Count and Actual Miss Count . . . . . . . . . . . . . . . . . . 134
xxiv
List of Tables
3.1 System Type vs State Space Model . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1 Multi-Core CMP Architecture Constraints . . . . . . . . . . . . . . . . . . . . 119
6.2 Workload/Thread Definition & Scope [KleinOsowski and Lilja, 2002] . . . . . 121
6.3 Adaptive Scheduling Framework Experiments Results . . . . . . . . . . . . . 121
xxv
xxvi
Chapter 1
Introduction
This thesis proposes a cache-aware adaptive closed loop scheduling framework to address the
shared computing resource management problem of threads. Despite the significant techno-
logical advancement in multiprocessor architectural and computational capabilities, it is still a
challenge to predict the computational resource requirement of threads. The lack of knowledge
on resource demands causes an inefficiency in the overall performance of the multi-core chip
level multiprocessor architecture. In this context, our real-time closed loop scheduling frame-
work for multi-core multiprocessing platforms is designed to increase the processor resource
allocation efficiency on multi-core multiprocessing platforms. In other words, accurate cache
resource estimation and an efficient processor cycle allocation are the main goals of our research
project. In order to achieve these goals, fair resource allocation is enforced to each core in
a way that in case the cache allocation request is not the same for co-runner threads, the
fairness is imposed by changing the CPU cycles of each thread in a single CPU quantum.
In our project context, fairness is the proportional allocation of different computing resources
belonging to the same thread. For instance, for highly active threads with a relatively high
instruction count allocation, fair allocation imposes a proportional cache memory allocation
so that threads currently allocated high processor cycles will not be stalled due to lack of
resources on the one hand, cache memory on the other hand. As a result, each thread will
execute a number of instructions per cache allocation as under ideal circumstances in which a
dedicated cache is allocated [Fedorova et al., 2006]. In this context, the metric indicating unfair
cache resources allocation is the difference between instruction count executed in the dedicated
cache environment and instruction count achieved in the shared cache environment. In fact,
this difference is equivalent to the cache stall caused by unfair cache allocation. In this case,
1
2 CHAPTER 1. INTRODUCTION
our framework allocates more processor cycles to that particular thread to drive it towards an
instruction count in a dedicated cache case.
Our cache-aware adaptive closed loop scheduling framework addresses the thread allocation
and corresponding cache sharing problems by facilitating adaptive self-tuning regulator control
system architecture with parameter estimation algorithm, QR Recursive Least Square(RLS).
In other words, our cache aware adaptive closed loop scheduling framework can be decon-
structed as the cache aware scheduling system, which formulates the algorithmic solution to
thread allocation and cache sharing problem, and adaptive self-tuning control framework, which
ensures the consistency of the solution for time varying (adaptive) thread allocation and cache
sharing problem. In our adaptive self-tuning regulator architecture, cache-aware adaptive closed
loop scheduling framework updates the controller parameters in line with any changes in the
operating environment, which is in our case shared cache allocations of each co-runner threads.
As a result, our research contributes to an innovative closed loop scheduling framework which
guarantees a stable and efficient scheduling of multiple threads on a multicore chip multipro-
cessing platform with the shared L2 cache.
1.1 Background and Motivation
Optimal and efficient resource management of limited computing resources has been a signif-
icant research problem, which has been addressed with many different architectural and algo-
rithmic approaches. Emerging application parallelism in line with the architectural parallelism
has improved the computing resource availability. In fact, this is achieved by the architectural
innovation encouraging parallel subtasks (thread) to utilize a pool of computing resources rather
than dedicated limited resources. However, this architectural revolution has also brought in a
new research problem, effective management of these shared resource pools. As a result, there
has been significant research effort on computing resource allocation efficiency problems in
multiprocessor architecture supporting thread parallelism at both software and hardware level.
In our thesis, our effort is focused on the processor cycle allocation on a multi-core chip level
multiprocessor with shared L2 cache resources.
In general, existing resource management/allocation (scheduling) approaches can be clas-
sified into cache-aware scheduling algorithms and traditional scheduling algorithms. In the
1.1. BACKGROUND AND MOTIVATION 3
traditional thread scheduling there are two main factors involved: the priority and the affinity.
The priority is determined through inheritance; for instance, a thread or task priority can be
inherited from the thread (task) spawned it. The affinity assigns threads/tasks to a subset of
processors [Villani, 2001] based on the historical allocation considering the fact that remnants of
a thread/task may remain in cache of core from the last thread/task execution. Hence, assigning
these tasks/threads on the same core results in a more efficient processor utilization than allocat-
ing them on a different processor [Corporation, 2003]. In multiprocessing environments, one of
the most widely used algorithms is PFair class algorithms which address efficiency loopholes of
uniprocessor scheduling algorithms such as Earliest Deadline First (EDF) and Rate Monontonic
(RM).
Cache-aware scheduling algorithms address the inability of processors in analyzing memory
access behaviors of cache units. Cache-Aware scheduling can be classified into two groups as
cache-fair scheduling, and cache pattern-aware scheduling (placement) . The cache-fair thread
scheduling algorithms take advantage of the fact that fair sharing of resources among threads
will minimize the resource conflicts and bottlenecks. The cache pattern-aware scheduling
(placement) algorithms are based on partitioning scheduling algorithms, in which tasks are
statically scheduled among the available cores based on the scheduling criteria [Zhou et al.,
2009a]. That is, they analyze the data patterns of the threads at application levels and place the
threads into the most convenient cores according to their cache data patterns behaviors.
Simulation results as well as the scope of the existing cache-aware multi-core scheduling
algorithms indicate that these algorithms are generally static open-loop algorithms which cannot
adapt dynamically itself against the dynamic processor states transitions such as varying cache
demands of threads (Fedorova et al. [2006]; Zhou et al. [2009b]; Ebrahimi et al. [2010]; Kim
et al. [2004]; Jahre and Natvig [2009]). Hence, the efficiency of an algorithm is only verifiable
on a specific system state; and the algorithm performance will degrade to the extent a system
state or operating point gets far away from the expected value. As a result, there has been a
necessity of closed loop algorithms replacing traditional open loop algorithms in such operating
environments where system states and parameters are subject to a significant variation. In the
context of multi-core chip level multiprocessor scheduling, cache access patterns can also be
identified as highly unpredictable and uncertain due to the highly correlated nature of co-runner
threads.The simulation results of Tam et. al. [Tam et al., 2009] on a number of applications
indicate that the cache miss rate curves might show great differences for each of the applications
4 CHAPTER 1. INTRODUCTION
or threads ; so, it is really hard to predict the thread data reference pattern as well as the cache
miss rate in such highly dynamic environments. Hence, the dynamic closed loop framework is
required to adapt against varying L2 cache access patterns.
There have been previous attempts in using a closed feedback framework in cache allocation
and multi-core chip architecture platform, but to our best knowledge, there is no research
work implementing adaptive self-tuning control theory for such a problem. Srikantaiah et. al.
[Srikantaiah et al., 2009] use the formal feedback control theory for dynamically partitioning the
multiprocessor shared last-level caches, and this is achieved by the last level cache utilization
optimization among multiple concurrently executing applications on a well defined service
level platform. Srikantaiah et. al. [Srikantaiah et al., 2009] also explain advantages of using
formal feedback as a theoretical guarantee maximizing the utilization of the cache space in a
fair manner.
The actual challenge in such environments is the fact that a CPU clock cycle (quantum) is a
very small period. In addition, superscalar core architecture, multithreading at processor level
and cache replacement algorithms continuously change cache and processor dynamics on each
clock period/cycle. In this respect, the multi-core multiprocessor efficiency and utilization can
be subject to a significant worst-case system destabilization. As a result, advanced closed loop
tools for a multi-core scheduling algorithm are proposed to track these multi-core chip level
multiprocessor cache behaviors.
In this regard, cache behavior can be classified as steady state cache and transient cache
behaviors. Steady state cache behaviors refer to the cache behavior after initial cache memory
blocks are allocated to the process/thread and requested initial data is successfully placed in
the corresponding blocks on chip cache units. In connection with steady-state cache behaviors,
a significant observation from the miss rate curves [Tam et al., 2009] indicates that after the
initial cache allocation for an application, miss rate curves(MRC) of the application approach
to a specific cache miss value. However, transient cache behaviors refers to the cache behavior
during the unsettled period from the initial memory access request to the placement of the initial
data in the cache memory. Transient cache behaviors ,which indicate how fast miss rate curves
approach their limits, show a strong dependence on the application data coherence. For instance,
if an application utilizes a streaming data, the cache content needs to be updated continuously;
1.1. BACKGROUND AND MOTIVATION 5
hence, the miss rate curve(MRC) for this type of application will have a versatile and non-
uniform characteristic. In contrast, some of applications have high data coherence, so these
applications regularly utilize the existing data in the cache. In this case, the miss rate curve is
expected to be uniform around a limit after a short transient state (settling period).
In such operating environments, different advanced control frameworks are applicable de-
pending on the system model and characteristics. Green and Limebeer [Green and Limebeer,
1995] also emphasize the insufficiency of the classical control approach, in case the plant
dynamic is complex and includes significant amount of uncertainties. As a result, the advanced
control tools such as robust optimal control and adaptive self-tuning control methods are prof-
fered. In our research, two alternative frameworks are researched: a robust control framework
and an adaptive self-tuning control framework. One of the most common robust control meth-
ods, H∞ optimal control, was developed in response to modeling errors and uncertainty and the
basic philosophy is to optimize worst-case errors [Green and Limebeer, 1995]. However, in this
scenario, the system dynamic is known and can be modeled with a predicted error/uncertainty
range (worst-case errors).
In contrast to robust control, an adaptive self-tuning control framework provides a higher
degree of flexibility in a plant model such that the adaptive self-tuning control framework
considers parameters of a plant model as time-varying entities and estimates these parameters
based on input and output measurement data. Ioannou and Fidan [Ioannou and Fidan, 2006]
state that adaptive self-tuning control is a very powerful tool in stabilizing the system in which
plant parameters have no bounded certainty, as well as in estimating online disturbance and
canceling its effect via feedback for this system. To some extent, a robust control framework
can also address these challenges; however, upper bound should be specified by the designer. In
contrast, the adaptive self-tuning control framework includes online learning capabilities about
plant dynamics, so, it is unnecessary for the designer to define any bound or restriction [Ioannou
and Fidan, 2006]. However, online learning would track only slowly varying parameters.
Further detailed discussion will be conducted about the limitations and benefits of these two
approaches in the literature review section.
In regards to the adaptive self-tuning (closed loop) control framework, the most important
and critical challenge is the modeling of the execution pattern of the thread, including instruc-
tion dynamics and cache patterns on multi-core multiprocessor platform. In this case, thread
6 CHAPTER 1. INTRODUCTION
execution pattern can be represented in terms of two dynamics: instruction count dynamics and
cache access pattern dynamics. Here, instruction count dynamics can be considered overall
closed loop system responses; in other words, actual execution patterns. On the other hand,
L2 cache access pattern model plants of the thread can be considered as a part of the overall
dynamics. In this respect, L2 cache access pattern, in fact, is an unknown constraint to our over-
all closed system. In developing L2 cache access pattern models, existing cache replacement
policies, and cache structure are considered. The paper of Suh et. al. [Suh et al., 2001] , which
proposes an analytical cache model for time-shared systems with a fully associative cache, are
one of the contributing references at this stage.
The implementation of any advance control framework as a solution in our research also
requires related knowledge on the system theory, statistical estimation techniques, parameter
estimation and basic linear and complex algebra. The limited discussion regarding these topics
will be done in the literature review section.
1.2 Research Problem
As stated in the previous chapter, efficient resource allocation, especially in a high complex
processor architecture as in multi-core chip level multiprocessor architecture with shared L2
cache memory, is one of the remarkable challenges for any multiprocessor researcher. In gen-
eral, execution slot/processor cycle allocation, in other words time centric scheduling problems,
are researched. In this thesis, the co-runner cache dependency on the time centric scheduling of
threads has been investigated.
In fact, the main motivation behind investigating particularly shared cache impact on the
time-centric scheduling framework is due to the performance advantages of shared cache uses
particularly for the chip level multiprocessor architecture with software parallelization such as
multithreading and hardware parallelization superscalar core architecture. Namely, shared L2
cache maximizes the cache utilization in case of core’s execution resources are underutilized,
and minimizes the data duplication in cases of co-runner threads sharing the same working
data set [Siddha et al., 2007]. The feasibility and advantages of the L2 shared cache design
on the S/390 G4 IBM server platform was also verified by the empirical performance data,
which was obtained by a series of experiments; particularly, significant improvement in cache
1.2. RESEARCH PROBLEM 7
hits over dedicated cache scheme was observed [Mak et al., 1997]. However, a shared cache
memory management turns into an additional factor having impact on the thread’s execution
performance, and overall performance of the multi-core chip level multiprocessor. In this case,
there emerges an obvious dilemma between dedicated and shared cache deployment for better
overall system performance. In most of the cases, it is inevitable to have heterogeneous data
access patterns of memory-intensive individual threads. The inefficient cache management with
these access patterns will result in cache contentions, cache misses and suboptimal performance.
Cache miss is actually referred to an unsuccessful attempt of cores to read or write a particular
word in the cache memory. As a result, the current write or read operation is delayed until
the memory word is retrieved from the main memory location. In other words, a cache miss
degrades overall performance of a multi-core processor proportional to a memory stall caused.
By definition, cache contention is caused by false sharing; namely, when multiple cores try to
update the same cache line, each has to be exclusive owner of the cache line in turn, so it slows
down the execution [SiliconGraphicsLibrary, 2000]. Impact of the contention on performance
is dependent on the data access pattern of the thread, resources shared and the number of active
threads. All in all, shared cache complexity basically changes the time centric scheduling
phenomenon as: a fair amount of CPU time allocated to each thread is not essentially reflected
into optimal and fair usage of shared resources; as a result, multi-core chip multiprocessor
overall utilization and performance might be degraded by cache related problems such as cache
contentions and misses. In line with the main research problem stated above, it is possible to
derive more specific research questions directly related with the proposed framework, which
addresses the main research problem. In this thesis, our effort is to address this main research
problem with a modern control framework. Hence, it is also necessary to mathematically
model the cache aware time-centric scheduling problem. Our model considers the execution
and cache access pattern of the thread. Due to the amount of uncertainty involved in our model,
particularly in cache access pattern, the adaptive self-tuning control framework is preferred. As
a result, the main research problem investigating co-runner cache dependency on time-centric
scheduling is degraded into more specific technical questions, which helped us to clarify issues
and address challenges related to our cache-aware adaptive closed loop scheduling framework.
The research questions emerged throughout our research effort can be summarized into three
main categories:
8 CHAPTER 1. INTRODUCTION
1. Thread Resource Allocation Strategy
(a) How can thread performance bottleneck, which results from shared L2 cache re-
source conflict among threads, be overcome?
(b) What would be the relation between thread execution pattern and cache access
pattern of threads?
2. Adaptive Self-Tuning Control Framework Related Research Questions
(a) Which adaptive self-tuning control strategy is applicable to the thread execution
resource management problem in multi-core multiprocessor architecture in line with
our research goals and scope?
(b) How can a classical adaptive self-tuning control framework be integrated into our
research problem and our closed loop dynamic system model?
(c) Closed Loop Dynamic System Model Related Research Questions
i. How can the L2 cache access pattern and its impacts on thread execution per-
formance be formulated including the selection of optimization criteria, perfor-
mance metrics, and controlled inputs in line with our research scope and goals?
ii. Which mathematical strategies can be helpful in the formulation of cache mem-
ory metrics such as cache miss counts and instruction counts, and control inputs
such as CPU quantum?
(d) System Identification Related Research Questions
i. Which dynamic system model parameters are subject to the system identifica-
tion as a part of the adaptive self-tuning control framework?
ii. Which system identification strategies are applicable into the existing adaptive
self-tuning control framework?
(e) Algebraic Controller Design Strategy Related Research Questions
i. Which algebraic controller design method is suitable for our adaptive frame-
work?
ii. What would be the desired adaptive closed loop characteristics?
3. Simulation and Experiment-Related Research Questions
1.3. SCOPE AND LIMITATION 9
(a) What would be the statistical metrics required by the adaptive scheduling frame-
work?
(b) Which experiment/test cases would be more informative about the success of the
system?
(c) What would be the success criteria for the cache-aware adaptive scheduling frame-
work?
The innovative scheduling control framework for the multi-core chip level multiprocessor
computing environment which challenges the existing scheduling frameworks in terms of per-
formance and feasibility is guided by the questions above and the responses to them.
1.3 Scope and Limitation
Due to the time frame of our research project and the complexity of processor architecture
models, the mathematical model of the thread execution pattern considers only specific L2
cache patterns with limited factors. This significantly simplifies the thread execution model.
The abstraction is achieved by not including specific chip level multiprocessor features
as direct constraints in the model, such as simultaneous multithreading (SMT) and L2 cache
memory features as in associative cache ways. In fact, parameter estimation algorithms on this,
more abstract level model, consider the system as a whole with its features since estimation is
based on the actual input and output data. Hence, modeling the system at a more abstract level
does not cause any loss of accuracy.
Despite the fact our framework is applicable to any computing application domain, a sig-
nificant performance improvement comparable to traditional scheduling approach can only
be achieved for applications, which exhibit weak temporal cache locality and high cache re-
quirements. For threads, which have a low cache space requirements and a strong temporal
cache locality, cache performance does not have a significant impact on overall processor
performance. In addition, it is expected that in case of co-runner thread does not have a
significant cache demand, which impacts on the cache performance of the target thread, the
framework acts as a passive framework.
As for the limitation at parameter estimation performance, the accuracy of the parameter
10 CHAPTER 1. INTRODUCTION
estimation is also dependent on the input and output data characteristics; that is, cache behavior
of threads. For instance, the input measurement with lack of coherence results in inaccurate
estimation. Hence, it is inevitable to have an error margin dependent on the cache pattern of
threads.
1.4 Contributions
The primary deliverable of the thesis is the adaptive cache aware closed loop scheduling frame-
work. The adaptive scheduling framework can be considered as real-time adaptive set of
algorithms, which applies the instruction execution fairness algorithm to the threads exhibiting
highly unpredictable cache and execution behaviors, and unanticipated state changes. In this
context, the adaptive cache-aware scheduling framework includes two major and three minor
contributions: execution fairness algorithm, thread execution pattern model, parameter estima-
tion algorithms, controller design algorithm and design of additional counters for simulation
purposes.
Execution Thread Fairness Algorithm In order to decrement the thread dependency on the
co-runner thread resource demands, an execution fairness algorithm is developed. This
algorithm considers the number of instructions executed in a dedicated resource environ-
ment, particularly in dedicated cache resources as a reference instruction count, and the
number of instructions executed in a shared cache environment as an actual instruction
count. In such a context, the execution fairness algorithm enforces the actual instruction
count to converge with the reference instruction count by allocating extra execution slots.
In this scenario, the cache behavior has a significance in determining the impact of an
allocated execution slot to the actual instruction count.
Thread Execution Pattern Model To achieve real-time adaptive system response against time
varying resource demands, a thread execution pattern model is developed. This model
considers the instruction count and cache resource pattern of the thread, and provides
state space equations, which allow us to construct a time-series regression model for our
parameter estimation algorithms. This model can be considered as an innovation due to
additional thread level granularity and cache awareness included in the model.
Parameter Estimation Algorithm QR Recursive Least Square algorithm is utilized in our
1.5. THESIS OUTLINE 11
real-time adaptive cache-aware scheduling framework. QR RLS algorithm achieves the
estimation of cache resource patterns using time series regression statistics of miss count
and co-runner miss counts. The actual contribution is in fact the development of a
regression model based on the time series miss count statistics. Applying a real-time
deterministic estimation algorithm can also be considered as an innovative approach
since conventional scheduling algorithms or resource allocation frameworks generally
prefer probabilistic strategies to predict the state of the system. As a result, in contrast to
probabilistic approach, our framework is able to track faster pattern transitions.
Controller Design Algorithm In order to track reference instruction count irrespective of ex-
ecution pattern and cache behaviors, an algebraic controller design algorithm, which is
able to re-design a new controller based on system states, is facilitated. Particularly, a pole
placement algorithm is selected for this purpose. The main contribution is the formulation
of the controller parameters in terms of execution pattern plant coefficient, which is in fact
estimated parameters.
Additional Counters for M-Sim In addition to the actual framework, in order to collect time-
series regression statistics, small counters and software modules are added to open source
M-SIM platform. More specifically, a software module, which is able to collect cache
hit/miss counts, and instruction counts for a sample period of 0.005sec or 500 cycle, is
developed. As a result, time series statistics, which are used in the parameter estimation
algorithm, are retrieved.
1.5 Thesis Outline
This thesis is divided into seven chapters. Following the introduction, Chapter 2 provides a
preliminary literature review on multiprocessor scheduling algorithms, multi-core multiproces-
sor architecture and adaptive self-tuning control frameworks. Chapter 3 provides an initial
mathematical model of thread execution patterns for the adaptive self-tuning control framework.
In Chapter 4, parameter estimation strategies for the cache access pattern are developed. Based
on the mathematical models, QR Recursive Least Square (RLS) algorithm is implemented to
capture the cache miss pattern of the thread. Chapter 5 investigates the algebraic controller
12 CHAPTER 1. INTRODUCTION
design algorithms, which are applicable to adaptive self-tuning control frameworks. Partic-
ularly, the Pole Placement method is investigated and utilized to iteratively design controller
parameter. Chapter 6 concentrates on simulation of the cache aware adaptive closed loop
scheduling framework, and interpretation of simulation outcomes. Chapter 7 concludes the
thesis and indicates future directions.
Chapter 2
Literature Review
In this chapter, the literature in the relevant research field is reviewed. In line with our topic,
adaptive cache-aware processor resource allocation, the literature in multi-core multiprocessor
system architecture, multiprocessor scheduling theory, and modern control theory has been
examined. Relevant findings have been summarized in the following sections.
2.1 Multi-Core Chip Level Multiprocessor System Architecture
Multi-core chip level multiprocessor(CMP) system refers to a system composed of two or more
cores integrated on a single chip. Despite the fact this on-chip architecture increases the overall
computing capabilities significantly, the multicore CMP systems have still some loopholes
in optimally managing resources among cores. In other words, how on-chip resources are
shared among cores is the actual challenge in these systems, and this impact overall system
performance. As a consequence, the processor designers endeavor to propose different mech-
anisms, and design options to optimize the resource management at chip level. These new
design approaches are implemented on three hardware levels:(1)Core Structure (2) Memory
Architecture (3) Core Diversity at chip level.
2.1.1 Core Structure
As an atomic element of multi-core systems, a core is an independent execution unit, as well
as an atomic entity of the whole system of which performance and functionality will form the
overall performance and functionality. As an independent execution unit, each core has Level 1
13
14 CHAPTER 2. LITERATURE REVIEW
(L1) data and instruction caches, ALU, registers, and some other hardware units. By definition,
core is a unit that reads and executes program instructions. For the smooth operation of a
program, instructions should be processed rapidly; however, a single core can only process one
instruction at a time. To improve the performance of instruction, a core designer takes advantage
of parallelism phenomena.
Recent development in parallelism has introduced new concepts such as the Superscalar
processor architecture. The Superscalar processor architecture was presented to improve in-
struction level parallelism provided by instruction pipelines in addition to the traditional ones
for instruction pipelines and branch predictors. According to Olukotun [Olukotun, 2007],
superscalar processors were developed to execute multiple instructions from a single instruction
stream on each cycle. In other words, multiple instructions can be executed by the processor
in a pipe stage. This is achieved by dynamically identifying a set of instructions capable of
parallel execution from instruction stream on each cycle, and executing them.
Despite the superscalar architecture capability of issuing multiple instructions on each cycle,
instructions on each slot are limited by a single process, so while a specific process is dedicated
to a core, no other instruction belonging to another process can be fetched in any of the exe-
cution slots. Hence, there occurs a significant slot waste since instructions are stalled due to
the retrieval of processor resources. In this context, multithreading is introduced to diminish
waste of execution slots in superscalar pipeline architecture. Multithreading is a thread level
parallelism on the core level. As discussed above, for instance, an n-way superscalar processor
has a n instruction issue slot for each cycle, and each slot includes a single instruction to be
executed in a pipe stage on a single cycle. Thus, maximum n instruction can be executed in
a pipe stage on each cycle. In case core supports only a single thread, then all execution slots
will belong to a single process for a dedicated time, and some execution slots in pipeline stages
will be wasted due to stall-caused memory access latencies or branch prediction failures. As a
result, there will be both horizontal and vertical waste. Horizontal waste occurs when some of
the execution slots out of n slot (n-way superscalar) are not be used on a single execution
cycle. Vertical waste refers to when a case execution cycle goes completely unused, so n
slots (n-way superscalar) are wasted [Thimmannagari, 2008]. Thimmannagari defines four
different multithreading techniques: vertical threading, simultaneous multithreading (SMT),
branch threading, and power threading [Thimmannagari, 2008]. In vertical threading (VT),
only instructions from one particular thread occupy a given pipe stage; in other words, multiple
2.1. MULTI-CORE CHIP LEVEL MULTIPROCESSOR SYSTEM ARCHITECTURE 15
threads share superscalar processing resources in aggregate but not in the same execution cycle.
Thus, this type of multithreading brings a complete solution to vertical slot waste. However,
simultaneous multithreading (SMT) permits independent threads to issue instructions to super-
scalar’s functional units on a single cycle. Hence, SMT addresses both vertical and horizontal
waste of execution slots. Despite little differences in efficiency, both SMT and VT aim to
minimize waste of slots in a superscalar processing architecture. For instance, as shown in
Figure 2.1, Thread 0 experienced a cache miss on fifth time slot of the thread issue slot given in
Figure 2.1b ; in this case, Thread 1 takes over the execution slot. Hence, memory stall doesn’t
lead to any execution time slot waste, and only causes stall on that particular thread, Thread 0.
By contrast, the branch and power threading address branch prediction and power related issues,
respectively. While branch threading offers thread switch based on hitting branch to prevent
branch penalty for processors not supporting dynamic or static branch prediction techniques,
power threading offers thread switching based on the power dissipated by a particular thread to
keep average power dissipated within specifications [Thimmannagari, 2008].
Figure 2.1: Instruction Issue and Cache Miss for a Single Threaded Processor and 2 ThreadedProcessor Supporting SMT[Thimmannagari, 2008]
In this context, our core structure uses instruction level parallelism with an instruction
pipeline on a superscalar core architecture; in addition, SMT is also supported by our core
16 CHAPTER 2. LITERATURE REVIEW
architecture. For instance, if our core structure, as in IBM POWER5 core, is on 8-way su-
perscalar architecture, then ’8’ threads’ instructions can be implemented per execution cycle
[Vetter et al., 2006].
2.1.2 Memory Architecture
Memory architecture on a multi-core CMP platform refers to on-chip memory architecture, so
called on-chip cache architecture. Due to the emerging applications being heavily dependent
on streaming and shared data sets, on-chip cache architecture gains significance for multi-core
CMP system efficiency and utilization. That is, off-chip data communication between cores
and main memory causes significant stalls on systems, while well designed on-chip cache can
minimize the off-chip data communication significantly. As a result, research efforts in the
cache architecture have put forward innovative solutions such as multi level cache structure and
additional accessibility capabilities as in shared caches. For instance, Intel Itanium 2 series
multi-core processors offer maximum reliability and minimum cache errors for enterprise level
servers. They are developed with a three level private cache memory architectures for each core
to minimize cache instability and errors [Intel, 2006]. However, dedicated or private caches can
rarely be utilized at their full capacity, so this leads to the waste of available resources. Namely,
the dedicated cache of core will be underutilized if there are not enough cache memory requests
from threads run on the core; on the other hand, is over-utilized in the case where the threads
request more cache memory than dedicated cache memory. Hence, most of the multi-core
multiprocessor architectures including AMD Opteron, AMD Athlon, Intel Pentium D have one
level private cache, so called Level 1 (L1) cache, and the shared cache, Level 2 (L2) cache
[Pfenning and Barbic, 2007]. Therefore, additional accessibility features as in shared caches at
some level of cache architecture is a popular strategy in cache architecture design.
In addition to accessibility characteristics of cache, memory access characteristics of shared
caches can be another memory architecture design option. According to Kent and Williams, uni-
form memory access(UMA) memory architecture refers to shared memory structure in which all
locations have the same access characteristic, including access times [Kent and Williams, 1997];
whereas, in Nonuniform Memory Access(NUMA) architecture, memory architectures can be
designed such that memory access characteristics including access times can be dependent on
the location relative to core [Tanenbaum, 2005] ; for instance, high capacity cache can be
2.1. MULTI-CORE CHIP LEVEL MULTIPROCESSOR SYSTEM ARCHITECTURE 17
deployed close to core with high data streams. Our research will be limited with the uniform
memory access shared cache architectures. NUMA shared cache architecture is generally
deployed in the heterogeneous core architecture. The further discussion on cache architecture
will be conducted under cache architecture and policies sections.
2.1.3 Core Diversity and Parallelism
The core diversity refers to the homogeneity of cores on the multicore platform. In this regard,
there are two different multi-core processing system architectures. The first architecture is the
homogeneous multi-core architecture where each core has same specification and role. The
multiprocessing systems which treat each core equally is called symmetric multiprocessing
systems(SMP) and those cores having the same specifications and features are called as homo-
geneous cores. For the sake of simplicity, our scope is limited to symmetric multiprocessing
systems and homogeneous core architecture.
As for parallelism in homogeneous cores, homogeneous multi-core processor platforms can
support multiple levels of parallelization; each core may implement any of the architectures
like superscalar (instruction level parallelism), VLIW (instruction level parallelism), SIMD
(data level parallelism), vector processing (data level parallelism),or multi-threading (task level
parallelism) depending on application domains. In other words, homogeneous multiprocessor
platforms can implement low level instruction parallelism with superscalar and pipelining ar-
chitecture, task (thread) level parallelism with multithreading or multitasking, and data level
parallelism with SIMD or vector processing architecture. For example, the IBM POWER5
Dual Core processor is on 8-way superscalar architecture (8 Threads per core) and support
simultaneous multithreading [Vetter et al., 2006]. SUN T1 8 core processor is on 4-way pipeline
architecture (4 Threads per core) with fine-grained multithreading [Microsystems, 2006].
In our research, our homogeneous multi-core chip level multiprocessor system takes advan-
tage of instruction parallelism with instruction pipelines on 8-way superscalar architecture, and
thread level parallelism with simultaneous multithreading(SMT) technology. As a result, only
task(thread) level parallelism is considered in details, and the cache related thread scheduling
challenges are addressed. In this respect, our research scope is limited to optimal thread
scheduling at multi-core chip processor level, which minimizes stalls due to cache misses.
Despite recent research on the thread scheduling having addressed performance issues with
18 CHAPTER 2. LITERATURE REVIEW
thread switching methods, our research goal is to put forward an alternative approach that
optimizes thread level scheduling.
2.2 Cache Architecture and Policies
Computer system memory systems has a hierarchical structure from highest level memory
unit (processors registers) to lowest unit memory (off line storage units). Each tier in this
architecture have different characteristics; for instance, as users go down the memory hierarchy,
they could observe decreasing costs/bits, increasing capacity, and slower access times. Each
memory system unit can be characterized using the following key characteristics [Stallings,
2003]:
1. Location: Processor, Internal(Main), External(Secondary)
2. Capacity: Word size, number of words
3. Unit of Transfer: Word, Block
4. Access Method: Sequential, Direct, Random, Associative
5. Performance: Access time,Cycle time, Transfer Rate
6. Physical Type: Semiconductor, Magnetic, Optical, Magneto-optical
7. Physical Characteristics: Volatile/nonvolatile, Erasable/nonerasable
8. Organization
The main motivation in creating such hierarchical structures is to boost the memory access
speed with optimal architectural cost and efficiency. Hence, data is duplicated or partitioned
among these different layers, such that processor access time for data in memory architecture
will be minimized. Throughout the discussion issues within the memory hierarchy, three fun-
damental concepts play a critical role [Stacpoole and Jamil, 2000] :
Inclusion Inclusion property indicates that all information elements are originally stored in
outermost level (e.g disk) Ln in the memory hierarchy, and subset of the original data and
instructions are moved up the hierarchy towards processor registers L0 during execution
2.2. CACHE ARCHITECTURE AND POLICIES 19
of process. Conversely, as data in L0 ages, it moves down back to outermost level Ln. In
addition, inclusion also covers methods and units of data transfer between two levels of
hierarchy.
Coherence Coherence property implies that copies of the same information or data elements
should be consistent all over the memory hierarchy.
Locality The locality property refers to clusters or isolation of behavioral characteristics of
data and memory demands of processes within certain regions of time and space. The
memory hierarchy is developed based on these behavioral characteristics of CPU. There
are two kinds of locality principles:
Spatial Locality occurs when numerically close memory locations adjacent to recently
accessed memory location, in the same block memory, are likely to be accessed in
the near future [Tanenbaum, 2005].
Temporal Locality In case at one point in time a particular memory location is accessed,
then it is probably that same location will be accessed again in the near future.
Based on these principles, detailed discussion on cache architecture, design elements, is con-
ducted in the following section.
2.2.1 Cache Architecture
In such a hierarchy, cache memory is a high-speed semiconductor volatile memory used by a
computer processor for temporary storage of information [Anthes, 2000]. Stallings [Stallings,
2003] define elements of the cache architecture as:
• Cache Size refers to the physical memory units allocated to the cache.
• Line Size refers to the addressable cache memory unit by the processor.
• Number of Caches: Single or two level/ Unified or Split indicates the number of levels
in the cache memory architecture of the processing platform.
• Mapping Function: Direct/Associative/Set-Associative refers to how actual physical
memory blocks are mapped into cache lines. For instance, Figure 2.2 demonstrates the 4-
way set associative cache architecture.In this figure, the cache contains a copy of portions
20 CHAPTER 2. LITERATURE REVIEW
of main memory, so called memory blocks as shown in Figure 2.2. The main memory
is divided into K word memory blocks, and each of these blocks is mapped into a single
cache line. Hence, set-associative mapping is a function which maps a set of memory
blocks to the specific cache line; whereas, associative mapping maps an individual block
of memory to any cache line. Direct mapping is a static mapping between cache lines and
memory blocks.
• Replacement Algorithm: Least Recently Used(LRU)/ Most Recently Used(MRU)/ First
in First Out(FIFO)/ Least Frequently Used(LFU)/ Random/ Adaptive & Dynamic Cache
Replacement Algorithms signify the policy, which determines which existent cache line
might be replaced as a new data block arrives from the physical memory.
• Write Policy: Write Through/ Write back/ Write Once is the policy which decides how
updates or changes on the data in the cache memory can be written back to physical
memory.
As mentioned previously, the main design objective of cache memory is to speed up the
memory access times based on the locality property of data patterns. Hence, these elements in
fact concerns how physical data from memory is mapped to the cache and how fast the requested
data can be served to the thread.
Figure 2.2: 4-Way Set Associative Cache and Main Memory
Based on these elements, different cache architectures can be constructed, and possibly each
unique cache design might be a good performer or a bad performer depending on the thread
2.2. CACHE ARCHITECTURE AND POLICIES 21
cache access pattern. As a result, it can be concluded that there is no best cache design.That
is the reason we consider the cache related issues as inevitable and try to compensate cache
related performance degradation using other resources.
2.2.2 Cache Performance Indicator: Cache Miss
In our project context, cache architecture refers to a single metric, which indicates the cache
performance for a particular thread. In this regard, one of the most widely used cache perfor-
mance indicators, cache miss, is used. Cache miss refers to the situation in which a requested
block is not found in the cache and has to be fetched from its original storage location or
lower level caches. In fact, a cache miss is not only an indicator for the correctness of a
design decision but also an indicator of the data pattern behavior of applications running on
a processor. Put differently, cache miss measures cache architecture response to the application
computational resource demands; hence, it is an unpredictable and dynamic indicator to be
reasoned for better system performance. In this regard, applications’ resource patterns as well
as the cache architecture characteristic should be considered. According to source of the cache
miss, cache miss can be categorized into three [Stacpoole and Jamil, 2000]:
Compulsory Miss A compulsory cache miss occurs when the very first access to the requested
block is failed, in this case the requested block is retrieved into the cache from the main
memory. This is also called cold start or a first reference miss. It is very hard to prevent
this type of miss due to unpredictability of application resource demands or data patterns,
and for such unpredictable patterns, the only solution might be having speculative loads
of data in the cache before data is required. Namely, larger cache block size can reduce
compulsory misses to the extent limited by the locality property of applications’ data
access patterns.
Capacity Miss A capacity cache miss occurs when the processor requests a cache block which
is already discarded. A capacity miss usually happens throughout the execution of large
programs or processes since it is impossible for the cache to hold all the blocks used
or needed by such large process. In the prevention of capacity misses, the replacement
policies as well as the cache size play an important role.
22 CHAPTER 2. LITERATURE REVIEW
Conflict Miss Two memory blocks corresponding to the same cache line number may be ac-
cessed repeatedly, and each one replace the other one in turn. This process is called
trashing. In the trashing process, requesting a new block causes another block to be
discarded [Stacpoole and Jamil, 2000]. Conflict miss occurs when cache request for
a new block is failed due to the same cache line being occupied by another memory
block. Conflict misses take places only in the direct-mapped or set-associative caches.
The conflict misses increase proportionally to the number of memory blocks mapped into
the same cache line number. In addition, larger block size also increases conflict misses.
However, using full-associative mapping prevents conflict misses, but there is a significant
implementation cost, and slowdown in cache access times.
As suggested, cache design elements impacts on cache misses improvement are limited with the
data locality characteristics of the applications, and processes. For instance, compulsory misses
are unpredictable at cache level before the actual request takes place, so those misses can only
be addressed by dynamic and stochastic cache management strategies, as well as application
data pattern prediction algorithms. In our thesis, cache misses are used as a metric, which is
involved in the estimation of the cache pattern access and the stall caused in the overall processor
performance.
2.3 Multiprocessor Scheduling
Scheduling is the act of assigning resources to activities or tasks. Scheduling problems include
a set of constraints such as deadlines to be met by any schedule. For any particular set of
constraints, two problems should be addressed. The first problem is the decision problem
which determines whether a given instance is feasible (schedulable) or not; the second problem
is the scheduling problem which builds the schedule for a given feasible instance. Hence,
any scheduling strategy considering a special scheduling problem can be partitioned into two
sub-problems: decision problem and scheduling problem [Baruah et al., 1996]. While the
decision problems is used in complexity analysis, the actual scheduling problem derives the
actual schedule for a given instance; hence, optimality of solution completely depends on
the scheduling problem. At this point, the actual difference between multiprocessor systems
and uniprocessor systems lies in the fact that the multiprocessor system’s feasibility does not
guarantee the optimality of the solution while uniprocessor system’s feasibility does.
2.3. MULTIPROCESSOR SCHEDULING 23
Despite the fact that emerging multiprocessing systems are an important innovation for
the computational capabilities of computing systems, multiprocessor systems also bring an
important challenge in extension of existence uni-processor real-time scheduling algorithms
into multiprocessing environment. In our research, our main interest is concentrated on the
multi-core chip level multiprocessor (CMP). One of the main characteristics of such systems is
the existence of shared memory and a common base of time [Cottet et al., 2002]. In this way,
each processor (core) has a global view of each processor (core) at any instant of time. In such
a processing environment, a scheduling algorithm is valid if and only if all task deadlines are
met. As a result, despite strong analogies between strongly coupled multiprocessing systems
and centralized systems (uniprocessor), Cottet et. al. [Cottet et al., 2002] underline the major
incapability of traditional centralized approach. According to Cottet et. al. [Cottet et al., 2002],
the traditional real-time scheduling on multiprocessors cannot be an optimal scheduling in such
multiprocessing environments.As a proof, he illustrates the EDF failure in satisfying optimality
on multiprocessing environment. Hence, the traditional real-time scheduling approach on the
uniprocessor platform should be adapted to the multiprocessing environment.
Throughout this section, multiprocessor scheduling taxonomy is introduced with the rel-
evant discussion on our multi-core chip multiprocessing(CMP) scheduling problem. Then,
traditional real time multi-core CMP scheduling is discussed. Before having a discussion on
the cache-aware scheduling, resource estimation and access pattern analysis are addressed in
details. Then, this is followed by the discussion on cache-aware scheduling covering the cache-
fair thread scheduling, and then pattern-aware cache scheduling(replacement) in details.
2.3.1 Multiprocessor Scheduling Taxonomy
The multiprocessor scheduling problem refers to a problem of pre-emptively scheduling a
real time task set on a symmetric multiprocessor (SMP) consisting of a of number of cores.
According to Bertogna et al. [Bertogna et al., 2009], this problem can be tackled in two different
ways [Bertogna et al., 2009]: (1)by partitioning tasks to processors (2) global scheduler. The
partitioning method analogous to the bin packing problem is in fact NP-Hard. However, for
a fixed task set and known priority, the problem can be degraded into a number of NP-easy
problems after initial partition of the task. In this case, this method provides a simple and
efficient approach. Otherwise, for an unpredictable or unknown processing environment, either
24 CHAPTER 2. LITERATURE REVIEW
online partition algorithms or additional load balancing algorithms should be used.
As an alternative to the partitioning approach, global scheduling has a single system-wide
queue of ready tasks; tasks stacked in this queue are scheduled on available computing resources
[Bertogna et al., 2009]. In contrast to the partitioning approach, different instances of the task
can be allocated to different cores and dynamic runtime workload changes can be addressed
since schedulability metric, the actual workload of the system, can be retrieved at each instant;
so, any instantaneous workload changes will be immediately reflected in the scheduling deci-
sion. In addition, if task migration is allowed, real-time load balancing can also be done on the
overall system. In that sense, global scheduling is more appropriate for the systems having time-
varying, unpredictable workloads [Bertogna et al., 2009]. Considering this fact, our scheduling
framework will be a global scheduling algorithm, rather than a partitioning algorithm.
In the multi-core chip level multiprocessor (CMP) system, a shared memory cache as well
as dedicated caches per core are deployed on Chip Level Multiprocessing (CMP) architecture;
in addition, each core is capable of handling multiple threads simultaneously, so called multi-
threading. Hence, in our scheduling strategy, it is possible to determine dynamic priorities at
thread level. The value-based scheduling policy, which considers the scheduling problem as an
optimization problem with additional constraints and parameters such as cache allocations and
multithreading, is used to determine the dynamic priority of each task. In other words, the set
of coefficients, which give an optimal solution to our scheduling optimization problem, are the
indicators of dynamic priorities of tasks.
Despite migrating threads among cores being an option, the necessity of migration on the
platform where resource sharing is implemented are main concerns. That is, migration of thread
to another core requires thread context switching, which moves the thread stack, including
pointers, L1 caches instruction and data states into a new core. In fact, all these operations
cause a delay due to transfer of information or states among cores. Despite all these processing
overheads and stalls, the thread migration is critical in some multi-core architecture such as the
NUMA (Non-Uniform Memory Access) architecture. In the NUMA architecture, cores have
faster access to their local cache than shared memory, on heterogeneous multi-core systems,
where each core is designed for a specific purpose and specifications such cache memory
capacity. In such environments, the initial thread placement will be based on data locality
of the thread. This causes specific cores to be overloaded due to the poor initial placement. In
2.3. MULTIPROCESSOR SCHEDULING 25
some cases quick load variations can lead to poor utilization of the core [Itzkovitz et al., 1998].
In these conditions, thread migration will make it possible to continue the execution of such
threads.
In our cache aware adaptive closed loop scheduling framework, thread migration is ignored
because in our framework, the stability of each core is very critical. That is, our closed loop
framework only addresses the steady state dynamics. In other words, thread migration would
cause our closed loop framework instability and this is not a desirable for our system.
2.3.2 Real-Time Multi-Core Multiprocessor Scheduling Algorithms
The real-time scheduling algorithms designed for the uni-processor platform fail to provide
optimality on multiprocessing platforms. In other words, the scheduling algorithm allocating
’m’ resources into an ’n’ processor or cores couldn’t realize optimal allocation, even if it
achieves a feasible (schedulable) allocation.
In other words, despite most of the traditional scheduling algorithms such as Earliest Dead-
line First(EDF), Least Laxity First(LLF) being feasible algorithms on multiprocessor systems
[Cottet et al., 2002], the actual problem is their failure in providing an optimal scheduling
rather than a feasible one. As a result, all these algorithms are outdated for the multiprocessing
platforms.
In this regard, the Pfair class of global scheduling algorithms is one of the major algorithms
that provides an optimal solution to multiprocessor scheduling problems. Pfair algorithm pro-
posed by Baruah et al. manage to optimally solve the problem of scheduling for multiprocessor
systems, which was previously believed as NP-hard problem, in a polynomial time [Anderson
and Srinivasan, 1999]. The scheduling strategy of Baruah et. al. considers the proportionate
progress; each task is scheduled resources in proportion to its weight. This is called proportion-
ate fairness [Baruah et al., 1996].
To degrade the worst-case computational complexity of the PF algorithm, Brauah proposes
another algorithm PD scheduling with a complexity of O(minmlogn, n), which replaces tie-
procedure of the PF algorithm with the more efficient one. According to the new tie-break
procedure, two new categories of tasks are defined as heavy and light tasks. In addition, the PD
algorithm associates a pseudo-deadline with each allocation of every light task, non-allocation
26 CHAPTER 2. LITERATURE REVIEW
of every heavy task. Moreover, the state of the PD algorithm is characterized by the tableau
keeping count information of the number of associated pseudo-deadlines satisfied up to this
point for each future slot and weight class [Baruah et al., 1995]. Despite the computational,
algorithmic efficiency of the PD algorithm, complexity of the algorithm encourages other re-
searchers to develop simpler algorithms.
Anderson and Srinivasan [Anderson and Srinivasan, 1999] manages to simplify the PD
algorithm further to the version, which includes one intermediate deadline and two tie-break
parameters. This proves the necessity for these two tie-break parameters to have a feasible
task allocation. Moreover, Anderson and Srinivasan reveal that consecutive window overlaps
of subtasks do not occur only at job boundaries. In other words, PD or PF algorithms use
the scheduling decisions given in each release of task T, so called job of T, for task priorities
for all subtasks of the job, or throughout the task period. Hence, there is no change in PF
priorities of tasks during task interval; however, this might create inefficiency. Considering this
possibility, Anderson and Srinivasan [Anderson and Srinivasan, 1999] introduce the swapping
argument, which discusses the systematically interchanging pair of subtasks on the basis of
priority definition during the task period. The feasibility and correctness of their swapping
argument and of their simplified priority scheme are also proved in their article in detail;
simulation of their simplified and revised algorithm is performed on a dual core environment.
In my opinion, the most innovative element in their work is the systematical interchanging of
subtask pairs throughout the task period.
The second paper of Anderson and Srinivasan [Anderson and Srinivasan, 2000] second
paper addresses the work conserving problem of the PD algorithm. In Pfair scheduling, each
subtask of a task T must bounded within a ”window” of time slots, and the last slot of this win-
dow represents the deadline of the subtask that belongs to task T. However, if some subtask of T
is completed ”early” within its window, then T is ineligible for execution until the beginning of
the next window of its next subtask [Anderson and Srinivasan, 2000]. This is against the work-
conserving principle. A scheduler is work-conserving if and only if it never leave the processor
idle in case there exists an uncompleted job in the system exists that could be executed on
that processor. Anderson and Srinivasan [Anderson and Srinivasan, 2000] introduce the ERfair
scheduling algorithm, which alters Pfair algorithm in such a way that if two tasks are part of the
same job, then the second subtask becomes eligible for execution as soon as the first completes.
In this way, the early release version of PD will have several advantages over PD. First of all,
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 27
average job response time is improved, especially in lightly-loaded systems. Moreover, the
runtime cost is decreased since there is bookkeeping information required by PD regarding
eligibility of task that is not considered in the Early Release PD algorithm (ER-PD) [Anderson
and Srinivasan, 2000]. In summary, Anderson and Srinivasan [Anderson and Srinivasan, 2000]
contribute to the multiprocessor scheduling with the work-conserving scheduling algorithm ER-
PD which can optimally schedule periodic tasks on a multiprocessor system in a polynomial
time.
Despite the efficiency and optimality of the real-time multiprocessor scheduling algorithms
in the allocation of tasks to multiple cores or processors, the main goal of these algorithms is
to optimally share the processor time-related resources. In other words, these algorithms are
time-centric scheduling algorithms, and their capabilities are limited within the time-related
resources. These algorithms fail to consider the factors such as cache memory resources, which
have a significant impact on the scheduling efficiency. In this case, cache aware scheduling
algorithms, which gains cache-awareness to the existence time-centric real-time multiprocessor
scheduling algorithms, are developed.
2.4 Cache-Aware Multi-Core Chip Level Multiprocessor Scheduling
Cache-aware multi-core chip level multiprocessor scheduling is addressing the incapability of
real-time multi-core multiprocessor scheduling algorithms to analyze memory related issues,
such as access patterns of threads, and memory stalls. Cache-Aware scheduling can be clas-
sified into three groups: cache-fair thread scheduling, cache pattern-aware thread scheduling
(placement) and adaptive cache-aware thread scheduling. Due to the relevance to our project,
only cache-fair and adaptive cache-aware thread scheduling are covered in this section.
2.4.1 Cache-Fair Multi-Core CMP Scheduling
The cache-fair thread scheduling algorithms take advantages of the fact that fair sharing of
resources among threads will minimize the resource conflicts and bottlenecks. In the context
of resource allocation, Jain et. al. [Jain et al., 1998] discriminate between the algorithms
as unfair and fair, and explain fairness in terms of quantitative formulas and metrics. Jain
et. al. [Jain et al., 1998] defines fairness as a two step approach. The first step is to select
28 CHAPTER 2. LITERATURE REVIEW
the appropriate metric depending on the application domain. The second step is to define
the fairness index which measures equality of resource allocation. In fact, fairness does not
necessarily correspond to equal distribution, it is completely dependent on the definition of
the fairness index. In addition, Jain et. al. [Jain et al., 1998] underscore the fairness index
properties as scale and metric independence, size independence, boundless, and continuity. In
this context, size independence refers to the fact that the number of elements sharing resources
is independent of the fairness index; and boundless refers to the fact that fairness index should
not be limited in value; and scale and metric independence denote the independence metrics
values or scales. Following the guidance of, the cache-fair scheduling algorithms deal with the
fairness in cache allocation.
Fedorova et. al. [Fedorova et al., 2006] apply the fairness concept in the cache allocation
problem on multi-core platform to reduce the effect of unequal multi-core processor cache
allocation. The architecture of multi-core processors enables multiple application threads to
run concurrently on a single processor chip. Multi-core multiprocessors generally use Level
2 (L2) shared cache memory, shared by concurrently running threads or co-runner(s). In such
architecture, cache sharing depends only on the cache demands of the co-runner(s), so it is
inevitable that unfair cache allocation takes place. A thread cache allocation affects the cache
miss rate, which has direct impact on cycle per instruction rate(CPI). CPI refers to the rate
thread retires instructions. Hence, a thread’s CPU performance is considerably dependent on its
co-runners. Fedorova et al. [Fedorova et al., 2006] summarize problems due to the co-runner
dependent performance variability as:
Unfair CPU Sharing A thread forward progress is dependent not only on its time slice share
of the CPU, but also the cache behaviors of its co-runner(s).
Poor Priority Enforcement In the case that a high priority job is scheduled with co-runner(s)
with high cache occupancy, it will experience poor performance rather than high perfor-
mance.
Inadequate CPU accounting On grid-like systems where users are charged for CPU hours,
the amount of computation performed varies depending on the co-runners; so, charging
is not appropriate.
In order to tackle these problems, Fedorova et. al. [Fedorova et al., 2006] proposes to
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 29
provide fairness in thread CPU latency. In other words, no matter what the cache allocation is,
the thread CPU latency will be the same as the CPU latency in cache-fair condition. Namely, if
the thread’s performance degrades due to the unfair cache sharing, the algorithm redistributes
CPU cycle time to threads and gives more time to that particular thread. In fact, the additional
CPU time given to the thread is actually taken away from other threads. Therefore, Fedorova
et al. [Fedorova et al., 2006] define two classes of threads in the system: cache-fair class and
best-effort.
Indeed, the approach of Fedorova et. al. [Fedorova et al., 2006] reduces the co-runner-
dependent performance variability for threads only in the cache-fair class. Figure 2.3 illustrates
CPU latency performance versus cache allocation for three different scenarios:
1. In the first scenario, the cache is shared unequally among threads, and unfortunately
Thread A gets less cache space compared to other threads. In this case, if the conventional
scheduler is used to schedule the threads, then it is inevitable that the thread A CPU
latency is more than equal sharing scenario.
2. The second scenario refers to an ideal case where cache is shared equally. In this case,
Thread A CPU latency is considered as an ideal CPU latency.
3. In the third scenario, similar to the first scenario, the cache is not shared equally and
Thread A gets a less cache space. However, in contrast to the first scenario, the cache-fair
scheduler SHARP is used. In this case, the Thread A is allocated extra cache space taken
over from Thread B in order to ensure that Thread A has the same CPU latency as in the
second scenario where cache allocation is fair among threads.
In the the article, the fairness index is in fact considered as a cache miss rate deviation
from the cache-fair miss rate. The derivation of cache-fair miss rate of thread T is based on the
assumptions: 1) Cache-friendly co-runners have similar cache miss rates, and 2) the relationship
between co-runners’ miss rates is linear [Fedorova et al., 2006]. If thread T run with several
different co-runners, then based on the relationship between Thread A and its co-runners, miss
rate of the Thread A is estimated with a cache-friendly co-runners, and this miss rate will be
30 CHAPTER 2. LITERATURE REVIEW
Figure 2.3: Thread CPU Latency vs Cache Allocation for Different Scenarios [Fedorova et al.,2006]
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 31
cache-fair miss rate. This method is formulated by Fedorova as below:
Relationship between co-runners miss rate is linear
MissRate(T ) = a×n∑i=1
MissRate(Ci) + b, (2.1)
where T is a thread for which fair miss rate is computed, and Ci is the ith co-runner, n is the
number of co-runners, a and b are the linear equation coefficients. Since thread T experiences
its fair cache miss rate FairMissRate(T) when all concurrent threads experience the same miss
rate:
FairMissRate(T ) = MissRate(T ) = MissRate(Ci), (2.2)
Then the Equation (2.1) can be expressed as
FairMissRate(T ) = a× n× FairMissRate(T ) + b,
=b
1− a× n. (2.3)
Hence, deriving unknowns a, b in the Equation (2.1) from known miss rate of thread T and its
co-runner(s) Ci is sufficient to calculate fair miss rate of thread T using the Equation (2.3).
After successfully estimating the fair miss rate of the thread T, the next step is to derive
the fairness metric, the thread’s CPU latency. Since thread CPU latency is the product of its
cycle per instruction(CPI) and its share of CPU time, it is possible to define fairness metric as
CPI [Fedorova et al., 2006]. Using CPI metric, the cache-fair scheduling algorithm proposed
by Fedorova equates to the fairness metric CPI to CPIFair, ideal value by adjusting its share
of CPU time or CPU cycle. Fedorova et. al. [Fedorova et al., 2006] provide the following
analytical model to perform this operation:
CPI = CPIIdeal + L2CacheStalls, (2.4)
where CPIIdeal is the CPI when there are no L2 cache misses, L2CacheStalls is the per-
instruction stall time due to handling L2 cache misses.
The fair CPI is:
FairCPI = CPIIdeal + FairL2CacheStalls, (2.5)
32 CHAPTER 2. LITERATURE REVIEW
where FairCPI is the CPI when the thread experiences its fair cache miss rate. In order to
estimate FairCPI, the CPIIdeal and FairL2CacheStalls terms are determined. To derive the
CPIIdeal, the Equation (2.4) is used. Actual CPI and the statistics, which are required to derive
the thread actual L2CacheStalls, are measured using hardware counters; then simple subtraction
operation will give CPIIdeal.
To compute FairL2CacheStalls, firstly, L2CacheStall is represented as a function of cache
miss rate, memory stall time MemoryStalls and the store buffer stall time StoreBufferStalls
[Matick et al., 2001].
L2CacheStalls = F (MissRate,MemoryStalls, StoreBufferStalls). (2.6)
FairL2CacheStalls is also derived using the same function, and same parameters as MemoryS-
talls and StoreBufferStalls, but with FairMissRate which is estimated by the Equation (2.3).
FairL2CacheStalls = F (FairMissRate,MemoryStalls, StoreBufferStalls). (2.7)
After deriving the fairness metric and index, Fedorova et al. [Fedorova et al., 2006] construct
the cache-fair algorithm as a two phase algorithm and each cache-fair thread goes through two
phases. The first phase, so called Reconnaissance phase, is the initial phase where fairness
metric, fair cache miss rate, and inputs for fairness index, fair CPI model are generated. Namely,
inputs for our fair CPI model are (1)actual CPI (2)the actual L2 cache stalls (L2CacheStalls),
and (3)fair L2 cache stalls (FairL2CacheStall). These inputs are obtained respectively:
1. Actual CPI is computed by getting a ratio of the number of CPU cycles T has executed
and the number of instructions T retired.
2. The actual L2 cache stall timeL2CacheStalls is estimated using the model given in the
Equation (2.6), and the inputs, cache miss rate, memory and store buffer stall times, are
retrieved from hardware counters.
3. FairL2CacheStalls is estimated using the model defined in (2.7). The inputs are retrieved
using hardware counters.
Once the reconnaissance phase is completed, thread T moves into the calibration phase,
where the scheduler periodically redistributes CPU time. Single calibration of a thread T
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 33
involves the calibration of the T’s CPU quantum (time) based on the fairness metric which
is the difference between actual and fair CPI ratio. In case the calibration of T’s CPU quantum
requires additional CPU resources, the scheduler picks the best-effort thread, some of which
resources are transferred to the thread T. Regarding the phase duration and repetition, after
the initial reconnaissance phase which is designed as 100 million instruction long, the cache-
fair thread moves into the calibration phase where thread’s CPU quantum are adjusted every
ten million instruction interval. The more frequent the adjustment is, the more responsive the
algorithm will be. After the initial reconnaissance phase, the following reconnaissance phase
is repeated every one billion instruction period. The reason the reconnaissance phase period is
incremented is based on the observation that L2 cache access patterns gradually and infrequently
change [Fedorova et al., 2006].
Fedorova et al. [Fedorova et al., 2006] also set up test cases with different workloads and
verified the significant drop in co-runner dependent performance variability for the cache-fair
threads. The paper of Fedorova et al. has a major impact on our cache aware adaptive closed
loop scheduling framework.
Besides the contribution Fedorova et al., Kim et al. [Kim et al., 2004] present a significant
study in fair cache sharing and partitioning. This paper makes several contributions. First of
all, it proposes a five cache metric which measures degree of fairness in cache sharing, and
evaluates these metrics and shows the correlation between these metrics. Kim et al. [Kim et al.,
2004] define the ideal fairness as an execution time fairness which is only achieved when all
co-scheduled threads have equal slowdown Tshr
Tded:
Tshr1Tded1
=Tshr2Tded2
=Tshr3Tded3
= · · · = TshrnTdedn
, (2.8)
which is referred to as execution time fairness criteria. The criteria can be met if, for any pair
of threads i and j co-scheduled, the following metric (M ij0 ) is minimized:
M ij0 = |Xi −Xj| , (2.9)
where Xi =Tshri
Tdedi. However, it is very difficult to measure Tshri due to lack of reference point
during the execution where the comparison of the dedicated and shared cache execution time
can be measured and compared. Hence, metrics, which are easier to measure, are defined and
34 CHAPTER 2. LITERATURE REVIEW
the ones which highly correlate with M ij0 are used. Kim et al. [Kim et al., 2004] define five
metrics for any co-scheduled threads i and j:
M ij1 = |Xi −Xj| where Xi =
MissshriMissdedi
, (2.10)
M ij2 = |Xi −Xj| where Xi = Missshri , (2.11)
M ij3 = |Xi −Xj| where Xi =
MissrshriMissrdedi
, (2.12)
M ij4 = |Xi −Xj| where Xi = Missrshri , (2.13)
M ij5 = |Xi −Xj| where Xi = Missrshri −Missrdedi
, (2.14)
By minimizing any of these metrics, a fair cache algorithm aims to improve fairness. When
there are more than 2 co-scheduled threads, the fairness metrics are averaged over all possible
pairs Mx =∑
i
∑jM
ijx and this sum is minimized to improve fairness. According to Kim et
al. [Kim et al., 2004], each metric has its way of providing fairness; for instance, metric M1
tries to equalize the ratio of miss increase of each thread, while M2 tries to equalize the number
of misses. Likewise, M3 seeks to equalize the ratio of miss rate increase of each thread, while
M4 seeks to equalize miss rates, and M5 tries to equalize the increase in the miss rates of each
thread [Kim et al., 2004]. According to how well these metrics correlate with the execution time
fairness M0 (the Equation (2.9)), one of these metric substitutes M0 to guide a cache fairness
policy.
Secondly, using these metrics, static and dynamic L2 cache partitioning algorithms optimiz-
ing fairness are proposed. In static L2 cache algorithm, stack distance profile is used to predict
parameters of any selected metric Mx. Afterwards, the partition, which minimizes the chosen
metric Mx, is applied from all possible partitions during the co-schedule. However, the stack
distance profiling can only be used with LRU cache replacement policy, and can be obtained
statically by compiler or simulation. Hence, the static L2 cache partitioning algorithm relies
on the caches to utilize the LRU replacement algorithm. In contrast, the dynamic L2 cache
partitioning algorithm provides a complicated algorithm consisting of three parts: initialization,
rollback and repartitioning. In the initialization step, cache is equally partitioned among cores.
Following this step, rollback step reverses a repartitioning decision which is not beneficial by
comparing miss rate statistics of before and after repartitioning decision. In the repartitioning
step, for each thread, the fairness is computed, and small cache blocks depending on partition
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 35
granularity are reallocated till cache fairness is provided. These three steps are repeated periodi-
cally, but the period length is kept long enough to observe the full effect of cache repartitioning.
Regarding algorithm overhead, L2 dynamic cache algorithm includes profiling and storage
overhead which is a number of registers per thread to measure miss counts; and Kim et al.
claim that overhead is less than 0.01% compared to the smallest time unit.
Kim et al. [Kim et al., 2004] perform a simulation to evaluate the performance efficiency and
operability of their algorithm. According to the simulation results, metrics M1, M3, M4 and M5
produce good correlation with the execution time fairness metric M0. Regarding the throughput
and fairness relation, Kim et al. [Kim et al., 2004] conclude that optimizing fairness usually
increases throughput, while maximizing throughput does not necessarily improve fairness. This
article provides us with a deeper insight into cache-fairness metrics, stack-distance profiling,
how partitioning algorithms can be used to enforce cache-fairness, and the correlation between
cache-fairness and throughput efficiency.
Zhou et al. [Zhou et al., 2009b] address the weakness of the cache-aware algorithms in
ensuring performance fairness. This article introduces a model to examine the performance
impact of cache sharing, and proposes a mechanism of cache-sharing management to provide
performance fairness for concurrently executing applications. According to Zhou et al. [Zhou
et al., 2009b], there are two factors having impact on the correlation of cache miss fairness and
performance fairness:
1. The performance sensitivity of the application to cache misses is dependent on a fraction
of execution time stalled on cache misses. The fractions of cache miss stalls are various;
for instance, data centric applications will spend many more cycles on memory access
than computation oriented applications.
2. The stall of each miss also differs according to Memory Level Parallelism[MLP], Instruc-
tion Level Parallelism[ILP]. For instance, misses’ latency cycles may overlap with recent
misses such that a data request occurred at higher level cache is directly issued to lower
level cache, shared L2 cache, and L2 cache may supply the data from previous requests.
In this way, cache miss penalty can be reduced. Moreover, computation operation cycles
may overlap with memory access latency cycles. In other words, while job is waiting
for computational operation being executed, meanwhile data can be supplied from the
memory.Hence, the actual penalty for cache miss will be reduced.
36 CHAPTER 2. LITERATURE REVIEW
Considering all these interrelations between cache misses and performance fairness, Zhou
et al. [Zhou et al., 2009b] define two fairness metrics: a performance fairness metric Mperf of a
pair concurrently running workloads i, j under certain cache partition, and cache miss fairness
metric Mmiss. Performance fairness Mperf and cache fairness Mmiss are achieved when Mperf
and Mmiss equal to zero, respectively.
Mperf =∑i
∑j
∣∣∣∣∣CPIshri
CPIdedi
−CPIshrj
CPIdedj
∣∣∣∣∣ , (2.15)
Mmiss =∑i
∑j
∣∣∣∣∣MPKIshri
MPKIdedi
−MPKIshrj
MPKIdedj
∣∣∣∣∣ . (2.16)
While Mperf indicates the sum of the slowdown difference among every two co-scheduled
workloads, Mmiss indicates the sum of the ratio of miss increase count differences between
every two co-scheduled workloads. The smaller Mperf and Mmiss are, the better performance
fairness and cache fairness are achieved.
Considering the correlation of cache miss fairness and performance fairness discussion
2.4.1, Zhou et al. [Zhou et al., 2009b] provide an innovative model to quantify the performance
impact of cache sharing. Firstly, the execution cycles of the application are categorized into
two classes: private operation cycles (Tpri) and vulnerable operation cycles (Tvul). Private
operation cycles are execution cycles, which only depends on the characteristic of workload;
that is, no other external factors like shared cache latency or memory access latency are involved
during the execution of these cycles. These cycles are consumed by operations which need the
private resources of the core; so these cycles are dependent on private resources and related
latencies such as computation time, fetching instruction and data latencies. However, vulnerable
operation cycles Tvul are sensitive to other external factors such as different co-schedulers on
the other cores, off-chip memory access latency and other off-chip related latency or factors.
However, especially in out-of-order processors, even instruction is stalled by L2 cache miss,
although other instruction can still be fetched into the pipeline if there is no data dependence.
Hence, during these cycles actually both off-chip memory access and on-chip instructions are
simultaneously performed, in other words, overlapped. These overlapped cycles are called
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 37
overlap cycles (Tovl). As a result, total cycles T is defined as
T = Tpri + Tvul − Tovl, (2.17)
where Tvul is equal to
Tvul =
Nmiss∑i=1
MissPenalty
= Nmiss ×MissLatency
MLPavg, (2.18)
Then the Equation (2.17) can be written as
T = Tpri − Tovl +Nmiss ×MissLatency
MLPavg, (2.19)
where MLP refers to the average number of useful outstanding off-chip request.
Using the fairness metrics and cache performance impact model, Zhou et al. [Zhou et al.,
2009b] build a hardware framework which enforces the fairness metrics. The hardware platform
includes two interdependent parts: cache partitioning mechanism and hardware profiler. The
cache partitioning mechanism is used to tag cache block to indicate which core use this block,
and for N core, tag will be log2N bits long. In addition, there will be a single bit flag overAlloc
which indicates whether core i has been allocated too much cache space. This mechanism
ensures that over allocated cores cannot gain additional cache lines, and the additional cache
blocks are used by under allocated cores. In other words, cache fairness is enforced.
Hardware profiler addresses the performance fairness for a shared cache environment by
enforcing identical slowdown for both workload running in a shared cache and workload run-
ning in a dedicated cache environment. In this context, the hardware profiler is responsible
for profiling necessary runtime or statistical parameters when applications are executing con-
currently, and it uses the Equations (2.17) and (2.19) to estimate the situation if running with
dedicated cache. After collecting necessary runtime and statistical parameters, fairness metrics
Mperf and Mmiss, given in the Equation (2.15) and (2.16) respectively, are used to evaluate the
fairness of the cores. Finally, the core with the least slowdown is decided as being allocated
too much cache space and the core overAlloc flag is set. Despite the fact that Zhou et al.
provide a complete solution which guarantees both performance and cache fairness on CMP
systems, there is a comparably significant hardware cost: Auxiliary Tag Directory (ATD),
which has same associativity as the main tag directory of the of shared L2 cache, is used to
38 CHAPTER 2. LITERATURE REVIEW
estimate some cache parameters for the hardware profiler. To achieve this, it actually attaches a
virtual ”private” cache to each core. In addition, for this purpose, auxiliary Miss Status Holding
Register (MSHR) is added to each ATD.
Ebrahimi et al. [Ebrahimi et al., 2010] propose a new approach that enforces the overall
shared memory system fairness, rather than the separate fairness mechanism for each individual
resources. In this way, the necessity of a complexity fairness mechanism which independently
implements fairness for each resources is eliminated. In addition, the approaches implementing
separate fairness mechanisms can make contradictory decisions, leading to low fairness and
loss of performance. Ebrahimi et al. [Ebrahimi et al., 2010] develop a low-cost architectural
technique, which allows system level fairness policies, also called as source-based fairness
control, which can be applied to entire memory systems in CMPs. In order to achieve this,
Ebrahimi et al. [Ebrahimi et al., 2010] propose a two step mechanism:
1. In the first step, dynamic feedback information of unfairness in the systems is collected.
At this stage, system fairness is continuously monitored.
2. The next step will be using this information to dynamically alter the rate at which the
different cores make resource requests to the shared memory subsystem such that system-
level objectives are met.
To calculate the unfairness in the system, execution time slowdown metric is defined Tshared
Talone,
where Tshared is the number of cycles for an application to run concurrently with other applica-
tion, and Talone is the number of cycles which would have been spent for application running
alone. However, it is a challenge to obtain information on Talone of the application while it is
simultaneously running with other applications. Hence, Ebrahimi et al. [Ebrahimi et al., 2010]
define Texcess as the number of extra cycles, which are required for an application to execute
due to inter-core interference in the shared system; and, to estimate Talone, Texcess is used in the
equation Talone = Tshared − Texcess [Ebrahimi et al., 2010]. In order to estimate Texcess, it is
necessary to keep track of the inter-core interference each core incurs, and for this purpose each
core has an InterferencePerCore bit vector for the purpose of indicating whether or not the core
is delayed due to inter-core interference. When interference is detected by cache miss counter
or memory controller, InterferencePerCore is set and the Texcess counter is incremented for each
core.
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 39
If the unfairness is larger than the unfairness threshold set by the system software, then
the core, which has the largest slowdown, is throttled down. In order to throttle down the
memory request from the core, Miss Status Holding/Information Registers (MSHRs) can be
used in a way that increasing/decreasing MSHR entries for a core proportionally affect the rate
of memory request injection from the memory into shared memory [Ebrahimi et al., 2010]. If
continuous fairness monitoring indicates that the system software goal has been met again, then
all cores are allowed to throttle up to improve system performance again by controlling MSHR.
Ebrahimi et al. [Ebrahimi et al., 2010] evaluate the performance of their technique on both
2-core and 4-core CMP systems and experimental results indicate that 25.6% to 14.5% system
throughput and 44.4% to 36.2% system unfairness reduction are achieved with the developed
technique with a low cost. This paper has a significance in our research since it is one of the
recent papers in the field, and it also provides a simple and efficient solution at the system level.
Jahre and Natvig [Jahre and Natvig, 2009] propose a novel fairness mechanism, called the
Dynamic Miss Handling Architecture (DMHA). The DMHA uses a single fairness enforcement
policy for the complete hardware-managed shared memory, that reduces the implementation
complexity significantly. DMHA facilitates Miss Status Holding Registers(MSHRs) available
in the private Level 1 (L1) caches to determine the number of misses the cache can sustain
without blocking. MSHRs can acknowledge the processor before cache blocked so that the
processor will reduce its execution speed since it is unable to fetch more data. Jahre and
Natvig uses this fact to set the execution speeds of the processors such that the slowdown due
to memory system interference is equalized across threads. In fact, Jahre and Natvig [Jahre
and Natvig, 2009] explain the whole approach as Fair Adaptive Miss Handling Architecture
(FAMHA) consisting of three step measurement, allocation and enforcement. For the measure-
ment of interference in a shared resource environment, Jahre and Natvig [Jahre and Natvig,
2009] propose the notion of Interference Points (IPs) which consider three types of shared
source interference: crossbar delay, shared cache contention and memory bus interference.
According to the metrics for each interference type, total interference points are calculated.
Then, a simple fairness policy decides whether or not any reduction in the execution speed is
necessary. In case the decision of fairness policy is in favor of enforcing processors to cut back
their execution rate, slowdown is equalized across the processors. For this purpose, DMHA
divides the total miss bandwidth between all cores at the end of private memory system, L1
private caches. That is, miss bandwidth per thread is allocated by controlling the number of
40 CHAPTER 2. LITERATURE REVIEW
available MHSRs at runtime. Hence, reducing the number of miss bandwidth, that is number
of MHSRs, slows down processor or core execution speed. To deactivate MSHR, extra bit field
usable bit(U) is created, in addition to conventional MSHR fields, block address of the miss
(Private Cache L1), some target information and valid bit(V). Namely, if usable bit(U) is set,
then MSHR is deactivated and can only be used to store miss data. This reduces the number of
shared memory access requests, and as a result, enforces the processor to reduce the execution
speed.
Using this light-weight fairness mechanism on chip level multiprocessor memory systems,
Jahre and Natvig improve fairness by 26% on average with the single program baseline, and
throughput by 13% on average when it is compared to conventional memory systems. This
paper introduces a simple and reasonable approach to the existing scheduling problem. In
addition, it provides a detailed insight to the Miss Handling Architectures (MHA) 1 and MSHRs
[Ebrahimi et al., 2010].
All in all, there are a variety of approaches to the resource allocation fairness problem on
CMP. The general strategy can be formulated into three stages: (1)Measurement (2)Allocation
(3)Fairness Enforcement. Algorithms are diversified according to their strategies mainly in one
of these three sub-stages. For instance, while the Jahre and Natvig article uses Interference
Points(IPs) [Jahre and Natvig, 2009] to measure unfairness, Fedorova et al. [Fedorova et al.,
2006] identify fairness based on the CPU latency or CPI. However, the general inclination is
to use execution time metric [Ebrahimi et al., 2010, Kim et al., 2004]. Regarding fairness
enforcement strategies, there are different approaches. For instance Fedorova et al. [Fedorova
et al., 2006] enforce fairness by adapting CPU cycles of thread. However, the general trend is
to enforce fairness using cache related parameters such as allocation and partition [Ebrahimi
et al., 2010, Jahre and Natvig, 2009, Kim et al., 2004, Zhou et al., 2009b].
2.4.2 Adaptive Cache-Aware CMP Scheduling
The existing cache-aware multi-core scheduling algorithms are generally static open-loop al-
gorithms which cannot adapt themselves dynamically against dynamically changing processor
operating states such as cache allocation. Hence, the efficiency of the algorithms is validated
on specific operating environments; and the algorithm performance degrades to the extent the
1For more information, refer to Kroft article[Kroft, 1981].
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 41
system state or parameter distances itself away from the ideal operating point. As a result,
there has been a necessity of closed loop algorithms replacing traditional open loop algorithms
in such operating environments where system state and parameters are subject to significant
variations. In the context of multi-core thread scheduling, L2 shared cache behavior can also be
identified as highly unpredictable and uncertain due to the highly correlated nature of co-runner
threads. Tam et al. [Tam et al., 2009] simulation results on a number of application indicates
that the cache miss rate curves might show a significant difference for each of applications or
threads; so, it is really hard to predict thread data reference pattern as well as cache miss rate.
Hence, cache access pattern itself introduces a significant variation to the operating platform.
In such an unpredictable operating environment, the closed-loop (feedback) control the-
ory ensures that the system adapts these system state changes at each control period. This
is achieved by setting a reference operating state and regulating the error between reference
operating and measured operating towards zero. In summary, control theory provides optimality
guarantees and precise reasoning, which is a way of detecting how far the system is from the
optimization objective, in terms of system parameters. Moreover, control theory assures quick
adaptive response to changes in dynamic behavior of the system parameters as in cache miss
parameters. In spite of these advantages, there are some challenges in the implementation of the
control theory. First of all, formulating the system behaviors is not an easy task especially for
complex systems . Secondly, the stability for the closed loop system is a major concern because
an open loop system stability might not ensure overall closed system stability.
Srikantaiah et al. [Srikantaiah et al., 2009] propose formal feedback theory in dynamic
shared last level partitioning on CMPs. Feedback control theory is used to optimize the level
cache space utilization among multiple concurrent applications with a well defined service
objectives. That is, adaptive feedback control based cache partitioning scheme is applied
to achieve service differentiation among applications with minimum impact on fair speed-up
metric, while ensuring the fair speed-up of the applications. Srikantaiah et al. [Srikantaiah
et al., 2009] underscore the necessity of QoS at application level which addresses lack of control
over management of shared resources; and offer cache partitioning as an effective service level
agreement method in shared-resource environments. One of the important global performance
objectives in a shared resource environment, the best utilization objective, is formulated as high
42 CHAPTER 2. LITERATURE REVIEW
fair speedup metric for each workload:
FS(Scheme) = N/N∑i=1
IPCappi(base)
IPCappi(scheme)
, (2.20)
where N is the number of applications in the workload such as the set of applications running
concurrently, andIPCappi(base)
IPCappi(scheme)represents the improvement gained by the application i with the
fair speedup improvement scheme. In this sense, FS is actually the indicator of throughput as
well as of metric fairness.
In this context, Srikantaiah et al. [Srikantaiah et al., 2009] design SHAred Resource Parti-
tioning (SHARP) control architecture which considers both fair speedup improvement as well
as service differentiation among various applications objectives. SHARP control architecture
includes three main parts as shown in Figure 2.4:
Per-Application Controller A Per-Application Controller specifies the cache space allocation
necessary to meet the specified performance targets.
Pre-Actuation Negotiator A Pre-Actuation Negotiator handles the situations where total cache
demanded by Pre-Application Controller exceeds available cache capacity.
SHARP Controller A SHARP controller adapts the reference performance targets to the ap-
plications in a fair manner with respect to the total utilization of cache, so that all appli-
cations would be served in a fair manner for any cache utilization scenarios.
In this architecture, shared last-level cache is partitioned into a number of cache partitions
according to the number of the applications running on the system, and partition sizes (number
of cache ways) dedicated to applications (App1, · · ·AppN ) are referred as w∗1, · · ·w∗N . The total
number of cache ways demanded by the Application Controller (AppC) should be less than
or equal to the total available cache ways of the cache∑N
i=1wi ≤ W . While Application
Controller(AppC) determines the ideal source share as in number of way(wi), Pre-Actuation
Negotiator(PAN) ensures that total requested cache ways are not more than the available cache
ways W∑N
i=1wi ≤ W . If∑N
i=1wi ≥ W , PAN will adapt the requested cache ways(wi) by
AppC into the feasible cache ways w∗i . Based on the allocated cache ways w∗i , ith application is
executed; and at the end of CPU cycle, performance metric (IPC) P outi is monitored and fed into
SHARP controller. Using performance metric P outi and performance reference P ref
i , SHARP
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 43
Figure 2.4: SHARP Control Architecture [Srikantaiah et al., 2009]
controller set a new performance target P ∗i for the Application Controller (AppCi).
Application Controller(AppC) is the main part of the control framework, which determines
the number of cache-ways (wi) required to achieve the target performance P ∗i using some more
additional information: performance achieved in the previous time interval P outi (t− 1) and the
partition size wi(t−1). Srikantaiah et al. [Srikantaiah et al., 2009] propose a customized AppC
controller based on Reinforced Oscillation Resistant Controller 2, which tracks the history of
cache decisions made to alleviate the problem of variation in cache space allocations caused
by traditional PID controllers [Srikantaiah et al., 2009]. In addition, it is necessary for relevant
cache allocation decisions to have an efficient cache performance model. Hence, Srikantaiah et
al. [Srikantaiah et al., 2009] formulate the control law and cache performance model:
Ψi(t) =
−log(1− (
P ∗i (t)
ϕ(i∞)
))
αi
, (CachePerformanceModel)
(2.21)
∀i, wi(t) = wi(t− 1) + Ψi(t)−mi(t− 1), (ControlLawforAppC(ROR)controller)
(2.22)
where ϕi(∞) is the instruction per cycle (IPC) of the application i when it is allocated maximum
2ROR controller uses similar design as in predictive controller, so please refer to Camacho and Bordons bookon MPC [Camacho and Bordons, 1999].
44 CHAPTER 2. LITERATURE REVIEW
number of cache ways (theoretically, Ψi(t) refers to predicted value of number of cache ways
for application i to catch target performance P ∗i (t), wi = ∞ ); the moving average mi(t − 1)
is the previous values of wi determined using the same control law, αi is the parameter that
approximately determines the utility of each additional cache ways that might be allocated to
application i. Here,αi and ϕ(i∞) are cache model parameters, which are learned online and
updated with continuous allocations according to the application’s data pattern behaviors.
The Pre-Actuation Negotiator(PAN) is used to obtain a feasible partition
{w∗1, w∗2, w∗3, · · · , w∗n} in the W-way partitioned shared cache, and PAN derivation of a feasible
partition is based on the demands {w1, w2, w3, · · · , wn} of AppC controllers in cases∑N
i=0 wi >
W . PAN can also implement different policies throughout this process to enforce fairness and
service differentiation. Srikantaiah et al. [Srikantaiah et al., 2009] apply two sample policies in
their paper:
Fair Speedup Improvement(FSI) This policy is to keep the fair speedup metric of the work-
load as high as possible. For that purpose, a total number of excessive ways recovery from
the workload is determined to specify a feasible partition as spillw =∑N
i=0(wi) −W .
Then for each application, a feasible number of ways w∗i is defined proportionally to the
demanded number of ways w:
w∗i =
⌊(wi(1−
spillw∑Ni=0wi
))
⌋, (2.23)
Service Differentiation(SD) Srikantaiah et al. apply service differentiation policy (SD) pro-
viding service differentiation among applications in case the cache space demand exceeds
available cache space, in a way that the higher priority applications are favored in the
recovery of cache-ways spillw.
SHARP controller aims to increase the utilization of the cache in case target IPCs are less
than the total number of available cache. Therefore, the reference IPCs for the AppC controllers
{P ∗1 , P ∗2 , P ∗3 , · · · , P ∗N} will be recomputed in a fair way if FSI is used in the PAN, such that
additional W −∑N
i=0(wi) cache ways are distributed fairly as:
P ∗i (t) = P refi
W∑Nj=0(
w∗j (t−1)
P outj (t−1)
P refj )
, (2.24)
2.4. CACHE-AWARE MULTI-CORE CHIP LEVEL MULTIPROCESSOR SCHEDULING 45
If the SD policy is used in the PAN, then weight factor {Θ1,Θ1, · · · ,ΘN} is also considered
during the distribution of cache ways:
P ∗i (t) = P refi
W∑N
j=0 Θj∑Nj=0(
w∗j (t−1)
P outj (t−1)
P refj ·Θj)
, (2.25)
In summary, SHARP control architecture guarantees optimal cache space utilization, fair
speedup improvement, and service differentiation among concurrently running applications
in a shared last level cache environment in CMPs. According to Srikantaiah et al., the fair
speedup scheme achieves 21.9% improvement on the fair speedup metric and, provides a well
regulated service differentiation on 2-core and 8-core systems. This paper has a significance in
our research work especially from an architectural and cache performance modeling point of
view.
There are a another sets of articles on feedback control theory from Brinkschulte and Pacher
[Brinkschulte and Pacher, 2008], and Velusamy et al. Velusamy et al. [2002]. Brinkschulte
and Pacher [Brinkschulte and Pacher, 2008] propose a closed feedback loop to control the
throughput (IPC rate) of a thread running on simultaneous multi-threaded microprocessors.
The thread synchronization and branch mis-prediction are considered as the factors which cause
throughput(IPC) of threads to fall behind ideal aimed throughput. In such highly unpredictable
environments, Brinkschulte and Pacher propose a feedback closed loop framework considering
only branch mis-prediction as a slowdown factor for the throughput of a thread. First of all a
relevant mathematical model is developed to define the relation between IPC and branch mis-
prediction.
IPC(n) =GP (n)
1 +MPR(n) ∗MPP, (2.26)
where GP(n) means a thread’s Guaranteed Percentage rate in the interval n, MPR(n) represents
a misprediction rate in the interval of n and, MPP is the number of penalty clock cycles. Then,
based on this mathematical model in the Equation (2.26), a new thread priority is defined for
a thread on each control cycle. That is, the thread actual IPC rate is monitored periodically,
and the difference between aimed IPC rate and actual value from the past instance is fed into P
controller in order to determine a new priority (the Guaranteed Percentage value(GP)) for the
thread.
GP (n) = P ∗ (IPCAimed − IPCC(n− 1)), (2.27)
46 CHAPTER 2. LITERATURE REVIEW
where P is the proportional factor used by the controller to derive a new GP value. Then,
Brinkschulte and Pacher analyze and discuss convergence, settling time and steady-state value
of controlled IPC rate. This paper provides us a perspective on how to apply control theory on
a simultaneous multithreading (SMT) multiprocessor platform.
Velusamy et al. [Velusamy et al., 2002] provide a general guideline on a formal controller-
design techniques application to trace a desired adaptive behavior. This covers modeling the
underlying system behavior, choosing a setpoint (target metric), sampling rate, as well as
implementation controllers in hardware. Velusamy et al. demonstrate this guideline using a
concrete example, for which cache decay interval is chosen. Velusamy et al. [Velusamy et al.,
2002] define the cache decay as a technique for leakage-energy savings that waits for some
pre-determined time the decay interval before concluding that a cache line’s data is no longer
in use, and that line can be deactivated [Velusamy et al., 2002] 3. Most importantly, the paper of
Velusamy et al. is a good roadmap for applying the control theory on a computer architecture.
The paper includes a basic level discussion on mapping adaptive techniques onto control loops,
which covers control loop parameters such as setpoint, controller parameters, control input,and
simple digital control theory concepts such as proportional controller (P controller), integral
controller (I controller) and controller gain selection and stability analysis. In addition to that,
Velusamy et al. also enlighten the readers on establishing dynamic models, selecting sampling
rates and building practical systems. In our research, this paper will a good roadmap in mapping
our control strategy onto our scheduling strategy and scheduling framework.
In summary, to the best of our knowledge, control theory approach is not a common strategy
on shared computing resource scheduling problems, and even the ones discussed so far only
apply some basic level control tools onto our cache aware chip level multiprocessor scheduling
domain. However, these tools provide a significant efficiency in case system dynamics and
system operating conditions have a small variation around an operating point. However, in
a highly unpredictable operating environment where the system operating conditions, states
and dynamics have significant variations, then the simple feedback control algorithms fail.
However, the modern control theories are still applicable to such scenarios, and handle the
unpredictability of the system and the operating environment.
3Since leakage-energy saving in multiprocessor systems is out of our research scope, further detailed discussionwill be skipped.Please refer to Velusamy et al. for more detailed discussion [Velusamy et al., 2002].
2.5. MODERN CONTROL THEORY FOR SCHEDULING PROBLEMS 47
2.5 Modern Control Theory for Scheduling Problems
In parallel with the increasing system complexity, the control system theory develops new
algorithms, methods and even theories to tackle system challenges. As a result, the modern
control systems emerge. Paraskevopoulos [Paraskevopoulos, 2002] explains the main difference
between traditional and modern(advanced) control systems :
1. The classical control refers to the single input single output (SISO) systems, hence design
methods prefer graphical methods such as Bode, Nyquist and root locus rather than
advanced mathematical design approaches;
2. Modern control refers to complex multiple input multiple output (MIMO) systems, and
the design methods are analytical methods which require advanced mathematics.
Modern control theory ,which is based on emerging time-domain analysis capabilities and state
variables, has been developed in order to handle complexity of modern plants, and meet the
strict requirement on accuracy and cost [Ogata, 1997]. From 1960 to 1980, research on optimal
control for both deterministic and stochastic systems and on adaptive control of complex sys-
tems was conducted. Optimal control theory refers to the mathematical optimization technique
for deriving control policies; whereas, adaptive control refers to the modern control scheme
capable of handling time-varying system behaviors. According to Ioannou and Fidan [Ioannou
and Fidan, 2006], adaptive control consists of a parameter estimator, which estimates plant
parameter online, and a control law to control sets of plants whose parameters are completely
unknown and/or have an unpredictable behavior. On the basis of the Ioannou and Fidan defi-
nition, adaptive control schema is considered as a main framework of our thesis project.In this
context, the rest of the literature review focus on the adaptive control.
2.5.1 Adaptive Control
The history of adaptive control systems goes back to the early 1950s. Adaptive control theory
was proposed in the early 1950s in order to control the system with an unknown or slowly time-
varying characteristic. The first application domain was in the avionic industry in the design of
autopilots for high-performance aircrafts. Such aircraft have a wide range of operation speeds
48 CHAPTER 2. LITERATURE REVIEW
and environments and conventional feedback controllers could only operate well in one oper-
ating condition, but not on the whole flight. Hence, the gain scheduling was the first proposed
adaptive control scheme believed to be a suitable technique for flight control systems. In the
1960s, development of state-space, stability and stochastic theory lead to better understanding
of adaptive systems; hence, Tsypkin proposed a common framework which unites system
identification (learning) with adaptive control. Following this, the major development in system
identification and estimation schemes, as well as different design methods are occurred in the
1970s. In the late 1970s and early 1980s, the proofs for stability of adaptive systems leads to
the idea of merging robust control and system identification; and this leads to robust adaptive
controls [Astrom and Wittenmark, 1994].
As explained above, adaptive control theory is developed to address inefficiency of con-
ventional feedback systems in controlling time-varying or unknown systems. That is, sys-
tems which have unknown plant, or of which plant parameters are subject to unpredictable
slow variations, are adaptive control application domains. For these systems, adaptive control
provides an adjustment mechanism to re-design the controller to meet control objectives and
specifications, and this process will be repeated on each time cycle. In this way, continuous
adaptation of the controller to the varying system characteristics will be achieved. In such
control systems, it is necessary to have two separate loops as in Figure 2.5: (1) normal feedback
with the process and the controller (2) parameter adjustment loop including online learning,
and controller adjustment [Astrom and Wittenmark, 1994]. So far, the system plant parameter
Figure 2.5: Block Diagram Adaptive System [Astrom and Wittenmark, 1994]
variation is defined as the decision threshold whether or not an adaptive control system is
required to control such system. In fact, linear feedback systems have also the ability to cope
with parameter changes to some extent. However, without a bound or limit, the stability of
the linear feedback system becomes a critical issue. Ioannou and Fidan [Ioannou and Fidan,
2.5. MODERN CONTROL THEORY FOR SCHEDULING PROBLEMS 49
2006] explain the case where an adaptive control system is superior to linear control as follows:
Consider the scalar plant
x = αx + u, (2.28)
where u is the control input, x is the state of the plant, and α is unknown. In such system, aim
is to choose u such that the state x is bounded and stable with time. If α was known parameter,
then linear control law
u = −kx, k > |α|, (2.29)
would meet the control objective.
In fact, for an upper bound α ≥ |α|, the above control law can also meet the control objective
with k > α. However, if α changes in the range α > k > 0, then the control system becomes
unstable. In conclusion, in the absence of upper bounds for the plant parameter, there is no
linear controller to stabilize the plant [Ioannou and Fidan, 2006]. As mentioned previously,
the adaptive control systems facilitates two different loops as adjustment loop and traditional
feedback loop. Hence, it is necessary for the adaptive system to observe system operating
parameters, and even in some cases, online learning or estimation of plant parameters. In this
case, it is necessary to have enough data to obtain stable information about these parameters.
Afterwards, based on online learning or observation, the controller parameters are adjusted. In
this regard, it is obvious that the adaptive system response to the changing operating parameter
is not instantaneous and takes a few cycles to be completed. Hence, an adaptive control system
is able to track only slowly changing operating conditions or plant parameters.
As discussed previously, adaptive control systems are applicable to the operating environ-
ment or plants with boundless parameter variations and/or unknown parameters. In such cases,
the source of variations in real systems might occur due to the [Astrom and Wittenmark, 1994]
:
Nonlinear Actuators Actuators, such as valves, are a very common source of variation and
have a nonlinear characteristics.
Flow and Speed Variations The pipe and tank systems can be another source of variation in
the sense that flows are generated by the source, and so source production rates have
significant impact on flow. That is, the system dynamics of flows are changed as produc-
tion rate (operating point) is changed; hence, a controller should also be adapted for such
50 CHAPTER 2. LITERATURE REVIEW
changes to have a better performance.
Flight Control The flight dynamics of airplanes change significantly with speed, altitude and
so on. Specifically, supersonic flight control systems such as autopilots and stability aug-
mentation systems create significant challenge to conventional linear feedback systems;
and this was the driving force in the development of adaptive systems.
Variations in Disturbance Characteristics Variation in disturbance characteristics can have
a significant impact on the controller performance, so does the variation in system dy-
namics. For instance, the design of an autopilot (control system) for ship steering must
consider the disturbance forces acting on ships such as wind, waves and current; and
these disturbance factors can have a significant variation due to unpredictable changes in
weather conditions. Hence, it is mandatory to adapt the controller parameters to cope with
such significant changes in disturbance characteristics of the system. This is applicable
to some flight control systems.
According to online estimation and parameter uncertainty compensation capabilities, adap-
tive control schemes can be categorized into three subcategories as the identifier-based adaptive
schemes which have online estimation capability but cannot compensate parameter uncertain-
ties; the non-identifier based schemes which support neither online estimation nor uncertainty
compensation; and the dual adaptive control, which facilitates both online estimation as hyper-
state estimation and uncertainty compensation.
Identifier-Based Adaptive Control Schemes
This class of adaptive control schemes combines the online estimator, which estimates the
unknown parameters of the plant at each instant of time, with the conventional control law. In
this class of schemes, the system plant is identified using online estimators at each control loop,
and based on the estimated plant model, the controller is designed using conventional controller
design methods for traditional feedback loops. The online estimation or system identification
process varies depending on the plant characteristic as:
Nonparametric Adaptive Control This scheme is based on the stochastic estimation algo-
rithms due to the fact that the plant has a stochastic characteristic rather than a determin-
istic one. In this case, the stochastic behavior of the plant is estimated.
2.5. MODERN CONTROL THEORY FOR SCHEDULING PROBLEMS 51
Parametric Adaptive Control This class of scheme is based on the deterministic estimation
algorithms since the plant can be represented by parameters; and in that case these pa-
rameters are estimated. This class of scheme can be approached in two ways:
1. Indirect(Explicit) Adaptive Control
2. Direct(Implicit) Adaptive Control
The further discussion on parametric adaptive control schemes is provided in the third chapter
System Model Adaptive Control 3.
Non-Identifier Based Adaptive Control Schemes
The non-identifier based class of adaptive control schemes, in contrast to identifier based ones,
do not include online estimators. Instead, search method is utilized in order to find the appropri-
ate control parameters from the range of possible parameters; or alternatively switching between
different fixed parameter controllers can be facilitated with the consideration that at least one is
stabilizing; or utilizing multiple fixed models for the plants can be another alternative.
Dual Adaptive Control Systems
Adaptive schemes described so far do not consider the parameter uncertainties and plant un-
certainties. In this regard, dual adaptive control systems provide a solution for the uncertain
adaptive systems. Dual adaptive control scheme derives a solution from an abstract formulation
of the control problem as in the previous schemes, and uses optimization theory, specifically
nonlinear stochastic control theory for the optimal solution. The major contribution and advan-
tage of dual adaptive schemes is that the uncertainties in the parameters are taken into account
by the controller; however, it is a complicated approach for practical problems.
2.5.2 Parameter Estimation
Accurate online estimation of the process parameters has a significance in the adaptive control
theory, particularly in identifier-based adaptive control schemes, and dual adaptive control
systems. In this sense, parameter estimation in the general context is the key element in our
identifier-based adaptive scheduling scheme. According to Ljung [Ljung, 1998], parameter
52 CHAPTER 2. LITERATURE REVIEW
estimation is actually searching for the parametric model, which is the best fit to a given
experimental data according to a given criterion from a set of parametric models. Parameter
estimation methods can be classified according to the source of statistical information, on which
model construction is based, prior knowledge and experiment data [Bohlin, 2006]:
White-Box Estimation White-box parameter estimation uses an invariant prior knowledge on
the system and its environment. The shortcoming of this identification approach is its
inability to handle the uncertainties and unpredictable changes in the system.
Black-Box Estimation Black-box estimation methods use statistical methods in order to pro-
cess experiment data and produce data description for the systems, which leads to the
selection of the suitable model. In this approach, the system uncertainties and changes
are explicitly considered as a part of the usual process; however, since the model is
completely based on the experiment data, the main concern is the inconsistency in the data
itself, which might lead to very different models throughout the iterative experiments.
Grey-Box Estimation Grey-box estimation methods use both prior knowledge and experiment
data to construct the most suitable model.
Regardless of this classification, Astrom and Wittenmark [Astrom and Wittenmark, 1994]
underline the key elements of parameter estimation to be heeded from the designer point of
view as:
Selection of Model Structure Selection of model structure involves the selection of a class
of models within which the most suitable model is selected. Depending on the system
complexity, linearity and nonlinearity, time characteristics such as time variance and
time invariance and mathematical representations of the system as in polynomial and
state space representation, different model structures can be constructed. In our research
project, a linear invariant discrete state space model is designed for the thread cache
behaviors.
Experiment Design The experiment design is considered as developing an experiment setup,
where the experiment variables including experiment metrics (which signals to measure),
experiment observation retrieval period (when to measure these signals) and input signals
(which input signals to be applied to system) are defined. The aim of the experiment
2.5. MODERN CONTROL THEORY FOR SCHEDULING PROBLEMS 53
design is to retrieve maximal information from statistical information for the optimal
determination of experiment variables. In the foundation of the experiment design in the
parameter estimation, two significant experiment types, online experiments and offline
estimation have significance. In our experiment, online experiment is selected due to the
fact that our parameter estimation method is required to be functional while the system is
still running.
Estimation Phase Estimation phase is in fact the selection of the best model from a predefined
model set. In other words, best mapping of input data vector into the output data vector
with minimal error has been sought, and the set of parameters which map each component
of input to output is called parameter vector θ. In our research, the least square estimation
family algorithms are used as an estimation tool.
Validation The validation stage can be considered as the mechanism, which validates the
selection of the best model out of the model set.
In summary, this section provides an abstract introduction to the parameter estimation in the
context adaptive control scheme. The key design steps in the parameter estimation proce-
dures including selection of models, experiment design, parameter estimation and validation
are briefly introduced. Throughout this section, the adaptive control context and possible
application domain related questions are addressed to ensure the coherence of the parameter
estimation with adaptive control theory.
2.5.3 Control System Design Algorithms
In the context of adaptive control, control system design refers to the mathematical approach to
design a system with a specified behavior for a given design parameters. As mentioned previ-
ously, adaptive control theory is based on the underlying assumption that the process dynamics
and environment continually changes. In this case, it is not convenient to consider a static
controller. That is, each time recursive parameter estimation tools estimate system parameters,
it is necessary to re-design the controller based on these estimated system parameters. Hence,
it is necessary to have recursive controller design algorithms, which is capable of determining
controller parameters for a given system parameters.
54 CHAPTER 2. LITERATURE REVIEW
The Control System Design approaches in the modern control theory can be categorized
under three main subcategories:
• Frequency Domain Control System Design Algorithms
• State-Space Control System Design Algorithms
• Robust and Optimal Control System Design Algorithms
In our adaptive scheduling framework, the state-space control system design algorithms is
used due to the compatibility with the adaptive control framework. State-space control design
methods can be boiled down into two different classes of methods [Paraskevopoulos, 2002]:
Algebraic Control Design Methods Algebraic method belongs to a particular category of mod-
ern control design problem where the controller has a pre-specified structure. In this case,
it is only required to determine the controller parameters for which certain closed-loop
performance requirements are met. These algebraic system design methods address many
control problems such as dead-beat, pole placement, input-output decoupling and exact
model matching design. In the context of adaptive control, these algebraic methods ad-
dressing pole placement problems are named adaptive pole placement methods [Astrom
and Wittenmark, 1994].
State Observer Control Design Methods State observer methods are used to generate a good
estimate x(t) of the state vector x(t) based on the mathematical model of the system;
hence, in this approach it is essential to have an exact mathematical model of the system.
Using these estimates x(t) and state feedback techniques, it is possible to solve many
control problems such as pole assignment, input-output decoupling and model matching.
In the context of the adaptive control, there is no exact mathematical system model; instead,
the parameters are estimated on each cycle based on the input-output relation of the system. In
this case, the algebraic control design method is much more convenient for our adaptive control
scheme. As mentioned previously, the rest of the discussion calls the algebraic design methods
as adaptive pole placement algorithms. The algebraic control design methods are discussed in
details in the Chapter 5 Algebraic Controller Design Methods 5.
In summary, as mentioned previously, different design methodologies can be utilized in
the control system design; however, state-space algebraic design methods, specifically adaptive
2.5. MODERN CONTROL THEORY FOR SCHEDULING PROBLEMS 55
pole placement methods are very common in the adaptive control schemes. Therefore, our
identifier-based indirect adaptive control scheme uses the adaptive pole placement method in
order to design the controller. As a future work, this control system design can be improved
with more advanced control design approaches such as H∞ robust control design methods to
increase the robustness in the overall adaptive control scheme.
56 CHAPTER 2. LITERATURE REVIEW
Chapter 3
System Model Adaptive Control
In this chapter, the mathematical dynamic resource pattern model of a thread in a shared cache
and processor cycle environment is developed. This model, in fact, formulates cache-aware
resource allocation problem in terms of constraints cache miss count, instruction count and
CPU cycles. At the end, this formulation constructs the basis of our cache aware adaptive
closed loop scheduling solution.
3.1 Theoretical Background of Dynamic System Model
In our thesis, thread execution and cache access pattern of the threads, particularly shared Level
2 (L2) cache behavior and corresponding slowdown on processor performance, is modeled. As
might be expected, due to the number of parameters involved and order of system, it is inevitable
to consider a mathematical approach, which is capable of analyzing high order systems with a
number of parameters and has a convenience with implementation on computing environment.
As a result, state-space theories and tools are preferred. The rest of the chapter focuses on the
state-space formulation of these systems and relevant adaptive strategies on these state-space
elements.
3.1.1 State-Space Theory
In the context of the state-space theory, the state space modeling and analysis of the control
systems are discussed. Before proceeding further, some preliminary definitions to the state-
space theory are as follows:
57
58 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL
State The state of a dynamic system is the atomic set of variables (called state variables) in
which knowledge at t = t0 together with the knowledge of input for t ≥ t0 completely
identify the behavior of the system for any time t ≥ t0.
State variables The state variables of a dynamic system are the variables forming the smallest
set of variables that determine the state of dynamic system. For instance, if at least
n variables x1, x2, · · · , xn are required to describe the behavior of a dynamic system,
then such n variables are a set of state variables. Most importantly, state-space theory
provides a significant freedom in selecting state variables such that state variables are not
necessarily either physically measurable or observable quantities. However, in practice it
is convenient to choose measurable quantities if it is possible.
State Vector Considering n state variables are required to completely describe the behavior of
a given system, these n state variables are n components of a vector x, which is called
state vector. In other words, a state vector is a vector that describes the system state x(t)
for any time t ≥ t0, once the state at t = t0 is given and the input u(t) for t ≥ t0 is
defined.
State Space The n-dimensional space <n, which is formed by n independent axis
x1, x2, x3, · · · , xn, is called state space; and any state can be represented by a point in the
state space.
State-Space Equations In state-space analysis, three variables are included in the modeling of
the dynamic systems: input variables, output variables, and state variables. According
to definition of a state vector, a state vector is able to describe system state for any time
t ≥ t0 if and only if the state at t = t0 along with the input u(t) for t ≥ t0 is provided.
In such cases, a dynamic system is able to provide the values of the input t ≥ t0; in other
words, the physical system must involve elements that memorize this information. For
instance, in continuous time control systems the integrators serve as a memory devices
and the outputs of the integrators are considered as state variables. Hence, the number of
integrators is equal to the number of state variables, which define the control systems.
Assume a MIMO system involving n integrator has r inputs u1(t), u2(t), · · · , ur(t) and m
outputs y1(t), y2(t), · · · , ym(t); and n outputs of integrators as state variables x1(t), x2(t),
· · · , xn(t) are defined. Then the system can be defined by the following equations [Ogata,
3.1. THEORETICAL BACKGROUND OF DYNAMIC SYSTEM MODEL 59
1997]:
x1(t) = f1(x1, x2, · · · , xn;u1, u2, · · · , ur; t),
x2(t) = f2(x1, x2, · · · , xn;u1, u2, · · · , ur; t),...
xn(t) = fn(x1, x2, · · · , xn;u1, u2, · · · , ur; t). (3.1)
The outputs y1(t), y2(t), · · · , ym(t) of the system are defined by
y1(t) = g1(x1, x2, · · · , xn;u1, u2, · · · , ur; t),
y2(t) = g2(x1, x2, · · · , xn;u1, u2, · · · , ur; t),...
ym(t) = gn(x1, x2, · · · , xn;u1, u2, · · · , ur; t). (3.2)
These equations can be transformed into a matrix form as:
x(t) =
x1(t)
x2(t)
...
xn(t)
,f(x,u, t) =
f1(x1, x2, · · · , xn;u1, u2, · · · , ur; t)
f2(x1, x2, · · · , xn;u1, u2, · · · , ur; t)
...
fn(x1, x2, · · · , xn;u1, u2, · · · , ur; t)
, (3.3)
y(t) =
y1(t)
y2(t)
...
yn(t)
,g(x,u, t) =
g1(x1, x2, · · · , xn;u1, u2, · · · , ur; t)
g2(x1, x2, · · · , xn;u1, u2, · · · , ur; t)
...
gn(x1, x2, · · · , xn;u1, u2, · · · , ur; t)
,u(t) =
u1(t)
u2(t)
...
un(t)
. (3.4)
60 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL
Then equations (3.1) and (3.2) can be written:
x(t) = f(x,u, t), (3.5)
y(t) = g(x,u, t). (3.6)
If the vector functions f and/or g is changing with respect to time, then the system is called
a time varying system.
If Equations (3.5) and (3.6) are linearized around the operating state, then linearized state
and output equations are as follows:
x(t) = A(t)x(t) + B(t)u(t), (3.7)
y(t) = C(t)x(t) + D(t)u(t). (3.8)
If a vector function f and g does not involve time t parameter, then the system is called a time-
invariant system. In this case, the equations (3.7) and (3.8) can be simplified as:
x(t) = Ax(t) + Bu(t), (3.9)
y(t) = Cx(t) + Du(t). (3.10)
These equations are the general form of state space and output equations defined for time-
varying systems (3.5), (3.6), for linear time varying systems (3.7),(3.8), and for linear time
invariant systems (3.9),(3.10).
Transformation between state-space representations in time domain and transfer function
representation in Laplace/frequency domain is another commonly used derivation during the
control design; hence, the following formulation will indicate how to derive a MIMO transfer
function from the given state-space equations. Consider a MIMO LTI System with the following
state-space representation:
X(t) = AX(t) +BU(t),
Y (t) = CX(t) +DU(t).
(3.11)
3.1. THEORETICAL BACKGROUND OF DYNAMIC SYSTEM MODEL 61
Take the Laplace Transform of the state-space equations:
L[X(t)
]= L [AX(t)] + L [BU(t)] ,
L [Y (t)] = L [CX(t)] + L [DU(t)] . (3.12)
The Laplace Transform gives the following result:
sX(s)−X(0) = AX(s) +BU(s),
Y(s) = CX(s) +DU(s). (3.13)
Separate out the variables in the state equation as follows :
sX(s)− AX(s) = X(0) +BU(s). (3.14)
Factor out X(s):
X(s) = [sI − A]−1X(0) + [sI − A]−1BU(s). (3.15)
Derive Y(s) using X(s) :
Y(s) = (C [sI − A]−1B +D)U(s),
H(s) = C [sI − A]−1B +D. (3.16)
In our computing environment, discrete-time quantities are used; in that case, continuous
time variable t will be replaced with discrete time variable k. Depending on the system type,
the state space representations can take different forms as below:
In this regard, our thread execution pattern model is composed of two interconnected finite-
dimensional discrete linear time-varying deterministic systems. Hence, our model is repre-
sented with corresponding state-space model in Table 3.1.
62 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL
Table 3.1: System Type vs State Space Model
System Type State-Space ModelContinous time-invariant x(t) = Ax(t) + Bu(t)
y(t) = Cx(t) + Du(t)
Continous time-variant x(t) = A(t)x(t) + B(t)u(t)
y(t) = C(t)x(t) + D(t)u(t)
Discrete time-invariant x(k + 1) = Ax(k) + Bu(k)
y(k) = Cx(k) + Du(k)
Discrete time-variant x(k + 1) = A(k)x(k) + B(k)u(k)
y(k) = C(k)x(k) + D(k)u(k)
Laplace domain of continous time-invariant sX(s) = AX(s) +BU(s)
Y(s) = CX(s) +DU(s)
Z-domain of discrete time-invariant zX(z) = AX(z) +BU(z)
Y(z) = CX(z) +DU(z)
3.1.2 Adaptive Control Theory
As briefly explained in the literature review chapter, adaptive control addresses the inefficiency
of conventional feedback systems in controlling time-varying or unknown systems. More
specifically, systems which have unknown plant, or of which plant parameters are subject
to unpredictable slow variations, are considered in the adaptive control application domain.
Adaptive control provides an adjustment mechanism to re-design the controller to meet control
objectives and specifications iteratively. As a result, continuous adaptations of controller to the
varying system characteristics are achieved.
As previously discussed in the literature review section 2, adaptive self-tuning control schemes
can be categorized into three subcategories as identifier-based adaptive schemes, non-identifier
based schemes and dual adaptive control schemes based on online estimation and parameter
uncertainty compensation capabilities. Here, only identifier-based parametric adaptive control
schemes, which is also referred as adaptive self-tuning control framework throughout the thesis,
is discussed. Please refer to literature review section for more details on other schemes 2.5.1.
Identifier-Based Parametric Adaptive Control Scheme
This class of adaptive control schemes combines an online estimator with a conventional control
law. In this class of schemes, a system plant is identified using online estimator at each control
3.1. THEORETICAL BACKGROUND OF DYNAMIC SYSTEM MODEL 63
loop, and based on the estimated plant model, the controller is designed using conventional
controller design methods.
A parametric adaptive control scheme is based on deterministic estimation algorithms rather
than on stochastic ones as in non parametric counterparts. In this scheme, the plant is repre-
sented by parameters, which will be estimated using measured input/output data. This class of
scheme can be approached in two ways:
Indirect (Explicit) Adaptive Control In this approach, the plant parameters are estimated first
and based on these parameters the controller parameters will be derived. That is, at each
time instant t, firstly estimated plant will be formed, and this plant is used to derive
controller parameters using conventional controller design methods. For instance, Figure
3.1 shows the structure of indirect adaptive control. In this figure, for a plant model
G(θ∗) where θ∗ is unknown, the online estimator generates an estimate of unknown θ∗
as θ(t). Then the control system design module generates the controller parameters θc(t)
for corresponding estimated plant parameters θ(t); and the controller parameters are fed
into controller (C(θc)). This whole process is repeated at each control cycle. In this
case, the main challenge in indirect adaptive control is to choose the class of control laws
C(θc), and the class of parameter estimators generating θc(t), so that controller (C(θc))
satisfies the performance requirements for the plant model G(θ∗) [Ioannou and Fidan,
2006]. Astrom and Wittenmark name this scheme as indirect self-tuning regulator.
Figure 3.1: Indirect(Explicit) Adaptive Control [Ioannou and Fidan, 2006]
Direct (Implicit) Adaptive Control In the second approach, the plant model is parametrized
64 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL
in terms of the desired controller parameters; and then, without calculating plant pa-
rameter estimates, desired controller parameters are estimated. For example, Figure 3.2
demonstrates the structure of direct adaptive control. In the direct adaptive control, the
plant modelG(θ∗) is parametrized in terms of unknown controller parameter θ∗c for which
C(θ∗c ) satisfy the performance objectives [Ioannou and Fidan, 2006]. In this case, the
online estimator will generate estimates of θc(t) of θ∗c by using input and output of the
plant. The estimated θc(t) is then fed into controllerC(θ∗c ). Similar to the indirect scheme,
the main challenge in the direct adaptive control approach is the choice of control laws
C(θ∗c ) and parameter estimators, which ensures the plant meets the performance targets
or requirements. Moreover, this control scheme is only applicable if and only if plant can
be parametrized in terms of unknown controller parameters.
Figure 3.2: Direct(Implicit) Adaptive Control [Ioannou and Fidan, 2006]
In our research project, the identifier-based parametric adaptive control scheme is utilized. The
reason the identifier-based parametric adaptive control scheme is preferred is due to the fact the
deterministic estimation approach is considered for our research project. As a future research
path, it might be worthwhile to compare deterministic and stochastic estimation approaches.
3.2 Development of Thread Execution Pattern Model
As mentioned previously, particularly in the multi-core chip level multiprocessor platform
where some level of cache is shared among cores, execution slot allocation of scheduling of
threads might lead to a performance bottleneck in case shared resources Level 2 (L2) cache is
3.2. DEVELOPMENT OF THREAD EXECUTION PATTERN MODEL 65
excluded in the scheduling decision. In this respect, our research effort and aim are concentrated
on developing a cache aware adaptive closed loop scheduling framework which provides an
optimal and fair allocation of instruction count while taking cache patterns of threads into
consideration. In line with our research aim, a dynamic thread cache access pattern model which
considers L2 cache miss count and co-runner threads miss count as a parameter, is developed
and its impacts on an overall processing speed is investigated. Actual processing power of a
processor is measured in cycles per instruction(CPI). Ideally for an infinite cache, it is possible
to consider CPI as an independent completely processor processing related metric; however, it
is inevitable to have stalls and delays during memory operations of instructions. Hence, Matick
et al. [Matick et al., 2001] define the system performance as
CPIAct = CPIIdeal + FCP, (3.17)
where CPIIdeal is an ideal CPI measured with an infinite cache system, and FCP refers to a
finite cache penalty due to memory stalls or queue delays. Finite cache penalty(FCP) is actually
the sum of all negative factors degrading the performance of a processor. These additional
factors can diversify as the system complexity, namely level of memory hierarchy, structure of
memory levels and number of cores increases. In this regard, for the sake of simplicity, it is
assumed that each core has only a single level of private Level 1 (L1) cache. Furthermore, since
our research scope is limited by the chip level multiprocessors and no static shared memory
partitioning scheme is deployed, cross-interrogation among multiple private caches is ignored
at this stage. In addition, as mentioned previously, it is assumed that caches are by default
write-through caches, so cast-out impact, which is applicable to store-in caches, is ignored. In
addition, the cache reloading time and trailing-edge effect are neglected due to the fact that their
effect is small especially for the large and high performance system with width and high-speed
buses. To sum up, in our cache model only request bus queue (RBQ) and data bus queue (DBQ)
delays, memory access latency (delay) and cache miss count are considered in modeling the
cache behaviors of the multi-core CMP system. Based on the model of Matick et al., the cache
model can be formulated as a linear function of DBQ and RBQ delays, memory access delays
and miss count as follows:
FCPL2 = F (MC,DBQ,RBQ,MAD), (3.18)
66 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL
where MC, MAD refer to miss count and memory access delay respectively. In this case,
CPIAct in the Equation (3.17) can be expressed in discrete time domain as follows:
CPIAct[k] = CPIIdeal(Cache→∞) + F (MC[k], DBQ[k], RBQ[k],MAD[k]), (3.19)
At this point, we can combine both DBQ and RBQ delays under the BQ variable.
CPIAct[k] = CPIIdeal(Cache→∞) + F (MC[k], BQ[k],MAD[k]). (3.20)
As discussed previously, our cache aware adaptive closed loop scheduling framework imposes
the instruction fairness by allocating extra cycles to the thread, which has a fewer instruction
cycles executed with respect to its dedicated version, in order to ensure that the instruction fair
thread has a fair number of instruction counts (IC) independent of a current cache allocation. In
this respect, the fairness condition can be defined as follows:
Definition 1 Fairness Condition. The fairness condition refers to the operating state in which
thread execution pattern independence on the co-runner cache access pattern is guaranteed.
In other words, the thread has the same amount of instruction count in the shared cache
environment as in the dedicated cache resource environment.
In this case, ∆CPUCycle[k] shall be the system input which ensures the fairness condition
in Definition 1. In this regard, instruction count on the next CPU cycle (CPU quantum) is
formulated as follows:
IC[k + 1] = IC[k] + (CPIAct[k])∆CPUCyc[k],
= IC[k] + (CPIIdeal + F (MC[k], BQ[k],MAD[k]))∆CPUCyc[k]. (3.21)
Since F is a linear function of MC[k], BQ[k] and MAD[k], the Equation (3.21) can be written
as:
IC[k + 1] = IC[k] + (CPIIdeal + A[k]MC[k] +B[k]MAD[k] + C[k]BQ[k])∆CPUCyc[k],
(3.22)
where A[k], B[k], C[k] are time varying parameter matrices referring to the cache miss penalty,
the memory access delay penalty and the buffer queue delay penalty respectively. In this regard,
3.2. DEVELOPMENT OF THREAD EXECUTION PATTERN MODEL 67
for the sake of simplicity A[k], B[k], and C[k] are assumed to be constant coefficient matrices;
in other words, cache penalties are considered constant over the time. As a result, A[k], B[k]
and C[k] in the Equation (3.22) can be replaced by A, B and C.
IC[k + 1] = IC[k] + (CPIIdeal + AMC[k] +BMAD[k] + CBQ[k])∆CPUCyc[k]. (3.23)
Remark 1 In fact, this Equation (3.23) is a bilinear equation provided that IC[k], MC[k],
MAD[k] and BQ[k] are dynamic parameter matrices defining the current state of the model
dynamics, which are updated as a part of system dynamics on each sampling period. In this
case, the system model equation can be considered in the bilinear equation form given in Elliott
[2009]:
X(k + 1) = AX(k) +BX(k)U(k). (3.24)
However, it is computationally expensive and hard to solve the bilinear system dynamic;
hence, the following hypothesis is developed to divide the existing bilinear system dynamic
into multiple linear system dynamics.
Hypothesis 2 The bilinear system model in the Equation (3.23) can be boiled down into four
sub-linear system dynamics based on the heuristic assumption that IC[k], MC[k], MAD[k] and
BQ[k] are referring to independent dynamics:
1.
IC[k + 1] = IC[k] + (CPIIdeal + AU1 +BU2 + CU3)∆CPUCyc[k], (3.25)
where U1, U2, U3 are actually outputs of sub-linear systems miss count dynamic system,
memory access delay dynamic system and buffer queue dynamic system respectively;
and these variables U1, U2, U3 are considered as inputs to the main system given in
the Equation (3.25).
2.
MC[k + 1] = A[k]MC[k] + C[k]CMC [k], (3.26)
where A and C refers to the unknown state matrix and the unknown input matrix respec-
tively, and CMC [k] refers to the co-runner miss count average 1p
∑pi=1MC(Ci) as an
68 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL
indication to the current cache state. In this regard, parameter estimation algorithm is
used to identify these unknown matrices.
3.
MAD[k + 1] = MAD[k] + V [k], (3.27)
is actually a stochastic linear system representation which is dependent on the probabilis-
tic characteristic of V[k] or random noise component.
4.
BQ[k + 1] = BQ[k] +W [k], (3.28)
is also a stochastic linear system representation which is dependent on the probabilistic
characteristic of W[k] or random noise component.
Considering the last three linear sub-system dynamics separately, and estimating the sub-
system characteristics in the Equations (3.26), (3.27) and (3.28), it is possible to determine the
thread execution pattern model (3.25) in that control period. This plant or system model is then
integrated into the adaptive self-tuning control framework in the following section to obtain
time varying robust control over the whole scheduling process.
Figure 3.3 illustrates these four sub-system state-space models (3.25), (3.26), (3.27), and
(3.28) which actually form the open loop thread execution pattern. This model is developed
on MATLAB Simulink platform. The model inputs, colored in green in Figure 3.3, are the
additional CPU cycle allocated to the thread, ∆CPUCyc[k], and the co-runner miss count
average, CMC [k]; whereas, system outputs, in light blue color in Figure 3.3, are the cache
miss count MC[k] and the instruction count IC[k]. In our model, for the sake of simplicity it is
assumed that V[k] and W[k] in Memory Access Delay(MAD) model (3.27) and Buffer Queue
Delay(BQD) model (3.28) respectively are band-limited white noise with 0.1 Watt noise power.
This thread execution pattern model is then used to construct the adaptive self-tuning control
framework.
3.2. DEVELOPMENT OF THREAD EXECUTION PATTERN MODEL 69
Figure 3.3: Thread Execution Pattern Model
70 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL
3.3 Control Framework Selection
Following the thread execution pattern model given in Figure 3.3, it is necessary to build an
adaptive self-tuning control system model. In this regard, our measurement metric is instruction
count IC[k], and ICCache−Dedicated[k] is our reference IC value.The error between actual IC[k]
and ICCache−Dedicated[k] is ∆IC[k] formulated as:
∆IC[k] = IC[k]− ICCache−Dedicated[k], (3.29)
Then, the internal variables miss count MC[k] with ∆IC[k] and the average co-runner miss
count CMC are fed into the adaptive self-tuning control framework as an input. The adaptive
self-tuning control framework module Pole Placement Algorithm uses cache pattern polynomial
coefficients and reference instruction count to design the linear controller at each control period.
The designed controller produces the controller output CPUCyc[k] for a given instruction count
error of the thread at a particular instant. Figure 3.4 illustrates this scenario, in the figure
Thread Execution Pattern Model box refers to Thread Execution Pattern model 3.3 discussed
in the previous section, and the QR Recursive Least Square (RLS) Estimator refers to the
embedded function block which uses input parameters MC[k] and CMC to attain a parameter
estimation in the Miss Count State Space Subsystem. Following the parameter estimation, the
pole placement algorithm module uses estimated parameters and the reference instruction count
signal to design controller, which determines control input CPUCyc[k] for a given instruction
count error ICError for that cycle.The detailed discussion adaptive framework is conducted in
the following section.
Figure 3.4: Closed-Loop CMP Control System Model
3.4. ADAPTIVE CONTROL FRAMEWORK DEVELOPMENT 71
3.4 Adaptive Control Framework Development
In our research project, our aim is to develop a general framework applicable to a wide range
of systems. Furthermore, as can be seen in the thread execution pattern model (3.23), there are
unknown system parameters already parametrized in the model itself. Using online estimation
methods, it is possible to retrieve these parameters. Hence, Self-Tuning Indirect (Explicit)
Adaptive Control scheme, which mostly matches our requirements, is decided upon.
In our thread execution pattern model, the main source of an uncertainty is due to cache
miss count dynamics of the system since the rest of the subsystems are assumed as stochastic
systems with known distribution. In this case, deterministic online estimation algorithms are
only applied to the state-space miss count subsystem defined in the Equation (3.26).
In this regard, the state-space representation of the cache miss count dynamic can be written
as follows under the consideration of general system order n(nth order system).
MC[k + 1] = A[k]MC[k] + C[k]CMC [k],
y[k] = HMC[k].(3.30)
Definition 3 The state vector MC[k], input vector CMC [k], and gain vector H in the Equation
(3.30) can be defined in matrix form as follows:
MC[k] =
[mc1[k] mc2[k] · · · mcn[k]
]T, (3.31)
CMC [k] =
[cMC [k − d] cMC [k − 1− d] · · · cMC [k −m− 1− d]
],
H =
[1 0 · · · 0
],
where H, MC[k] ∈ <n, and CMC [k] ∈ <m, and d refers to time delay between the input and
output. In this instance, the state variables mc1[k], · · · ,mcn[k] are defined as n time shifted
signal.
MC[k] =
[mc[k]mc[k − 1]mc[k − 2] · · ·mc[k − n− 1]
]. (3.32)
In this respect, using the state-space model given in the Equation (3.30) and the Definition 3, it
72 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL
is possible to derive the input-output relation.
y[k] = mc[k] = A1[k]
mc[k − 1]
...
mc[k − n]
+ C1[k]
cMC [k − d]
...
cMC [k − d−m+ 1]
. (3.33)
ConsideringA1[k] C1[k] as row vectors with coefficients a1, · · · , an and c0, · · · , cm−1, the fol-
lowing discrete-time deterministic autoregressive moving average(DARMA) time series model
can be derived:
y[k] = mc[k] =n∑j=1
ajy(k − j) +m−1∑j=0
cjcMC(k − j − d). (3.34)
Then, the DARMA time series model (3.34) can be expressed in a static parametric model
(SPM) form.
z = y[k] = mc[k] = θ∗Tϕ,
where θ and ϕ are defined
θ∗ = [a1, a2, · · · , an, b0, b1, · · · , bm−1]T ,
ϕ[k] = [mc[k − 1], · · · ,mc[k − n], cMC [k − d], cMC [k − d− 1], · · · , cMC [k − d−m+ 1]]T .
(3.35)
Here, the Miss Count State Space Model is expressed in terms of static parametric model (SPM)
which is composed of known signals vectorϕ and unknown coefficient vector θ∗. In this regard,
parameter estimation algorithm QR Recursive Least Square (RLS) tries to estimate θ∗ for a
given inputs and outputs signals in ϕ[k]. The accuracy of this estimation depends not only on
the properties of the adaptive control law, but also on the properties of the plant input sequence.
It is required to analyze the properties of the adaptive control law and the plant input; the
following lemma summarizes necessary properties of both the adaptive control law and the plant
input for a stable adaptive control scheme.
3.4. ADAPTIVE CONTROL FRAMEWORK DEVELOPMENT 73
Definition 4 A wide class of adaptive control laws can be represented by the following form
ε(k) =z(k)− θT (k − 1)ϕ(k)
m2(k), (3.36)
θ(k) = θ(k − 1) + Γε(k)ϕ(k), (3.37)
where θ(k) refers to the estimate of θ∗ at time t=kT, ϕ(k) is the regressor, z(k) is the output of
the SPM
z = θ∗ϕ, (3.38)
m(k) ≥ 1 > 0 is the normalization signal, and Γ = ΓT > 0 represents the adaptive gain
Ioannou and Fidan [2006].
For the adaptive self-tuning control framework which uses the adaptive control laws in the form
defined in Definition 4, the following lemma is useful in analyzing the stability property as well
as defining estimation accuracy of θ∗.
Lemma 1 Provided that ‖Γ‖|ϕ(k)|2m2(k)
< 2, then the adaptive control law Equations (3.36) and
(3.37) have the following properties Ioannou and Fidan [2006] :
i) θ(k) ∈ `∞ where `∞ is the infinity norm of discrete-time signals.
ii) ε(k),ε(k)m(k), |ε(k)ϕ(k)|, |θ(k)− θ(k −N)| ∈ `2 ∩ `∞. `2 and `∞ refer to the Euclidean
and the infinity norm respectively. In this respect,the Euclidean form of discrete time signal
x x : Z+ → < is
‖x‖2 = (∞∑i=0
|x(i)|2)1/2; (3.39)
whereas, the infinity norm of x is
‖x‖∞∆= supl∈Z+|x(i)|, (3.40)
where sup (suprenum) or least upper bound of a set Z+ refers to smallest positive in-
teger(norm) which is greater than or equal to every positive integer (norm) in Z+. If
x(i) ∈ `2 ∩ `∞, then x(i) ∈ `p for all p ∈ [1,∞). Then, for a x(i) sequence, the norm can
74 CHAPTER 3. SYSTEM MODEL ADAPTIVE CONTROL
be defined as:
‖x‖p = (∞∑i=0
|x(i)|p)1/p`p = x(i) ∈ <n : ‖x‖p <∞. (3.41)
In this case, the intersection of `2 and `∞ indicates that infinite discrete vector sequences
ε(k), ε(k)m(k), |ε(k)ϕ(k)|, |θ(k)− θ(k −N)| are bounded by all norms, which guarantee
the stability of these sequences on a infinite dimensional vector space.
iii) ε(k),ε(k)m(k), |ε(k)ϕ(k)|, |θ(k)− θ(k −N)| → 0 as k →∞, where N is any finite integer
N ≥ 1. This ensures that all these discrete sequences reach zero as these sequences reach
infinity; in other words, adaptive control law finally reach a steady state.
iv) Provided that ϕ(k)m(k)
is persistently exciting if it satisfies
l−1∑i=0
ϕ(k + i)
ϕT (k + i)m2(k + i) ≥ a0lI, (3.42)
∀k and some fixed integer l > 1 and constant a0 > 0, then the estimate θ(k) approaches to
θ∗ (θ(k)→ θ∗) exponentially.
3.5 Concluding Remarks
In summary, the theoretical formulations so far discussed in this chapter have formulated thread
execution pattern model Miss-Count dynamics in the static parametric model (SPM) form for
the estimation of unknown system parameters. Furthermore, Lemma 1 is provided to analyze
the stability and determine accuracy of the estimation of a given adaptive adaptive law. All these
mathematical formulations are applied as a MATLAB function code to the adaptive self-tuning
control framework block given in Figure 3.4.
Chapter 4
Parameter Estimation in Adaptive Control
In this chapter, one of the main components of adaptive self-tuning control framework, pa-
rameter estimation strategies is investigated. In our thesis, the least square linear estimation
techniques are considered as the most appropriate strategy for our cache aware adaptive closed
loop scheduling framework. In this chapter, particularly adaptive weighted QR Recursive Least
Square (RLS) algorithm is researched, and on the basis of the theory, a proprietary algorithm
is designed. The chapter starts with a brief theoretical background on the parameter estimation
and the least square algorithms family, and ends with the adaptive weighted QR recursive least
square algorithm.
4.1 Theoretical Background of Parameter Estimation
In the context of the adaptive self-tuning control scheme, which is called identifier based
control scheme in the control literature, parameter estimation is crucial in identifying system
parameters. In order to successfully apply parameter estimation to the adaptive self-tuning
control scheme, the proper construction of a model set of the system model is essential. In fact,
the model set M∗ is defined as M∗ = {M(θ)|θ ∈ DM} and each model M(θ) in this set is
associated with a predictor y(t|θ) with an associated prediction error PDF fe(x, θ). Search for
the best model within the model set is a problem of finding a parameter vector θ such that the
prediction error is minimized.
The parameter estimation is actually a mapping from input/output data ZN to the parameter
75
76 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL
vector θ which minimizes the prediction error.
ZN → θN ∈ DM where ZN is
ZN = [y(1), u(1), y(2), u(2), · · · , y(N), u(N)] y and u is output and input samples(4.1)
The evaluation of the candidate model is based on the prediction error of the model M(θ∗)
ε(t, θ∗) = y(t)− y(t|θ∗). (4.2)
For a given data set ZN , prediction error ε(t, θ∗) can be derived for t = 1, 2, · · ·N ; and
the ”best model” is the model M(θ∗) which has the smallest prediction error. To quantify the
prediction error sequence given in the Equation (4.2), there are various approaches but the most
common approach is forming a scalar-valued norm or criterion function, which measures the
size of ε(t, θ∗).
In order to find ”best” parameter vector θ′ corresponding the ”best model” M(θ∗) in the
model set, the prediction error criterion function of all models in the model set M∗ is derived
from the prediction error sequence given in the Equation (4.2) and the θ′ which minimizes the
prediction error criterion function will be the parameter estimate of the system based on input
and output data ZN . The size of the prediction error sequence or input/output data sequence is
design specific, and the bigger the size of error sequence, the more precise the estimation will
be.
More specifically, the minimization of error prediction sequence with respect to parameter
vector θ can be summarized in a number of steps:
1. In the first step, the prediction-error sequence with size of <N is filtered through lin-
ear filter L(q) to remove high-frequency disturbance, slow varying terms not critical in
modeling problems [Ljung, 1998]. This step can be called the pre-filtering stage. The
pre-filtering stage is critical and effective in frequency-domain interpretation of criterion
function (4.4) or objective function:
εF (t, θ) = L(q)ε(t, θ), 1 ≥ t ≤ N. (4.3)
4.1. THEORETICAL BACKGROUND OF PARAMETER ESTIMATION 77
2. The second step includes the construction of the norm or criterion function
VN(θ, ZN) =1
N
N∑t=1
L (εF (t, θ)), (4.4)
where L is a scalar-valued function. In fact, L is a critical factor in determining criteria
in our model evaluation or error optimization process. In this respect, the choice of L
function is important; the most common choice is a quadratic norm where L (ε) = 12ε2
which is convenient both in computation and analysis, but which has a few shortcomings
in terms of robustness. It is also possible to parametrize quadratic norm with respect to
(w.r.t) θ as L (ε, θ). In addition, time-varying L norm is also optional if the measurement
reliability such as noise characteristics and measurement weighting, varies with time. In
such cases, weighting function β(N, t) can be included in the criterion function given in
the Equation (4.4):
VN(θ, ZN) =N∑t=1
β(N, t)L (ε(t, θ), θ). (4.5)
The weighting β(N, t) is useful in recursive identification where estimates θ for different
N is calculated.
3. The third step includes the calculation of the parameter estimate vector θ which minimizes
the criterion function given in the Equation (4.4):
θ = θ(ZN) = argθ∈DMminVN(θ, ZN). (4.6)
As regards to the multi variable system, the criterion function and quadratic criterion is
rewritten to cope with multiple prediction error sequences. In the multi variable systems or the
MIMO systems, the quadratic criterion is modified as:
L (ε) =1
2εTΛ−1ε, (4.7)
where Λ is a symmetric, positive semidefinite p× p matrix that refers to the relative importance
of components of prediction error p×1 vector ε. Using the quadratic criterion the p×p criterion
78 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL
function is constructed as follows:
VN(θ, ZN) = h(QN(θ, ZN)) where, h(Q) =1
2tr(QΛ−1),
QN(θ, ZN) =1
N
N∑t=1
ε(t, θ)εT (t, θ).(4.8)
After derivation of the criterion function (4.8), the calculation of parameter estimate θ is analo-
gous to single-input case, and will be solved using the same equation (4.6).
Depending on the choice of L (·), pre-filter L(·), model structure and minimization method,
parameter estimation methods are varied. In fact, choices of these elements play a significant
role in selecting the particular estimation method from a family of parameter estimation algo-
rithms. In our thesis, we choose the least-squares parameter estimation family due to its com-
patibility in recursive identification schemes and minimization methods with low computational
costs.
4.2 Least Square Parameter Estimation
Least square estimation provides a solution to an inexactly specified system of equations;
in other words, overdetermined systems. An overdetermined system refers to a system of
equations where there are more equations than unknowns; and in fact, these equations fails
to find an exact solution for this system. As a result, for a given system of equations, the least
square estimation provides an approximate solution. To be more precise, least square estimation
is an optimization problem with a known cost/error function, and the solution to this problem
is a set of parameter estimates, which minimizes this error/cost function. For instance, let us
assume that g(ϕ) refers to an overdetermined system, and a number of independent inputs ϕ,
so called regressor, and output y are known. However, for this system, it is impossible to find
an exact solution or pattern. In this case, the least square estimation algorithms provides a set
of parameters θ which is a linear weight of a given regressor inputs, and the combination of
this set of parameters with the regression vector gives the approximation of the system output.
In this circumstance, an error is the difference between the actual and approximate system
output. Indeed, least square estimation algorithms aims to find the best parameter vector, which
minimizes this error.
4.3. ADAPTIVE WEIGHTED QR RECURSIVE LEAST SQUARE ALGORITHMS 79
As a reminder, please note that regression model refers to a set of functions, which statisti-
cally or deterministically predicts the system behavior g(ϕ) in terms of a set of parameter θ for
a given regression inputs ϕ and system outputs y. In fact, this system behavior g(ϕ) predicts
system output, y(t|θ).
4.3 Adaptive Weighted QR Recursive Least Square Algorithms
In traditional recursive least square algorithms, the stability depends on the autocorrelation
matrix R = (ϕTϕ), where ϕ is the input data vector, and inverse autocorrelation matrix P.
Despite the fact R is considered the non-singular, inverse of R (R−1 = P ) can become ill-
conditioned depending on the input signal persistence condition or characteristic.
Considering these possibilities, it is necessary to develop a better strategy to solve the
Recursive Least Square (RLS) problem. Hence, the QR decomposition approach, which is
numerically well-conditioned, is deployed. It actually prevents inaccurate solutions to the RLS
problem, and allows easy monitoring of the positive definiteness of a transformed input matrix.
4.3.1 Theoretical Background
To begin with a quick review, Recursive Least Square (RLS) algorithm searches for the coef-
ficient of the adaptive filter, which minimizes the cost function in a recursive way. RLS cost
function is defined as follows:
ξd(k) =k∑i=0
λk−iε2(i), (4.9)
ε = mc(k)−ϕ(k)w(k), (4.10)
ϕ(k) = [mc(k)mc(k − 1) . . .mc(k −N)cmc(k)cmc(k − 1) . . . cmc(k −N)]T , (4.11)
w(k) = [w0(k)w1(k) · · ·wN(k)],
where y or mc refers to the output of the system, which is miss count. Here, cmc, which is co-
runner miss count, refers to input of the system with miss count (mc). Moreover, λ,ϕ(i),w and
ε refer to adaptive weight coefficient, input regression vector, parameter weight vector and error
vector respectively. Please note that despite the fact that previously θ notation has been used
80 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL
for parameter coefficient vector. From now on, to prevent the confusion with Givens rotation
matrices, w notation will refer to parameter coefficient vector. To begin with, for N-dimensional
system, the following input data matrices are constructed using the elements: actual output miss
count (mc) and co-runner miss count (cmc).
ΨT (k) = Ψ,
=
mc(k − 1) λ1/2mc(k − 2) · · · λ(k−1)/2mc(k −N) λk/2mc(k −N − 2)
mc(k − 2) λ1/2mc(k − 3) · · · λ(k−2)/2mc(k −N − 2) 0
...... . . . ...
...
mc(k −N − 1) λ1/2mc(k −N − 2) · · · 0 0
cmc(k − 1) λ1/2cmc(k − 2) · · · λ(k−1)/2cmc(k −N) λk/2cmc(k −N − 2)
cmc(k − 2) λ1/2cmc(k − 3) · · · λ(k−1)/2cmc(k −N − 2) 0
...... . . . ...
...
cmc(k −N − 1) λ1/2cmc(k −N − 2) · · · 0 0
,
=[ϕ(k)λ1/2ϕ(k − 1) · · ·λ1/2ϕ(0)
].
(4.12)
Following input matrix construction, the next step is the calculation posterior error vector
ε, which is actually the difference between estimate ˆmc(k) and actual value of miss count
mc(k):
ˆmc(k) = ϕ(k)θ(k) =
ˆmc(k)
λ1/2 ˆmc(k − 1)
...
λk/2 ˆmc(0)
, (4.13)
4.3. ADAPTIVE WEIGHTED QR RECURSIVE LEAST SQUARE ALGORITHMS 81
mc(k) =
ˆmc(k)
λ1/2 ˆmc(k − 1)
...
λk/2 ˆmc(0)
, (4.14)
ε =
ε(k)
λ1/2ε(k − 1)
...
λk/2ε(0)
= mc(k)− ˆmc(k). (4.15)
Following posterior error vector derivation, the cost function is derived using the posterior error
vector given in the Equation (4.16):
ξd(k) = εTε. (4.16)
For such system and posterior error equations, the optimal RLS solution can be formalized as
given in the Equation (4.17):
ΨT (k)Ψ(k)θ(k) = ΨT (k)mc(k), (4.17)
where conventional RLS solution can be ill-conditioned because of the possibility of ill-conditioned
behavior of RD(k) = ΨT (k)Ψ(k) and its inverse due to the loss of input persistence excitation.
To avoid this, the QR decomposition approach is used.
4.3.2 Formulation and Theoretical Conclusions
The QR decomposition approach in a recursive least square (RLS) problem can in fact be
better explained in three stages: initialization, input data matrix triangularization, and QR-
Decomposition RLS Algorithm.
82 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL
1st Stage:Initialization
During the initialization stage, for a given input ϕ(k) and output mc(k) data from k=0 to k=N,
it is possible to derive initial parameter vector w using RLS solution given in the Equation
(4.17). This process is so-called using back-substitution algorithm, yielded with the following
equation:
wi(k) =−∑i
j=1 ϕ(j)wi−j(k) +mc(i)
ϕ(0). (4.18)
These parameter coefficients are considered as the initial condition for the RLS algorithm. The
reason the back-substitution method is preferred is that it does not use the matrix inversion
method while calculating parameter vector elements; in other words, it is a well-conditioned
algorithm. However, it is necessary for this method to have a data matrix in triangular form.
As a result, triangularization of input data matrix is conducted as a preliminary step for the
back-substitution algorithm.
2nd Stage: Input Data Matrix Triangularization
As mentioned previously, input data matrix triangularization is a preliminary step for back-
substitution, and this triangularization step should be repeated recursively on each control
cycles. This iterative triangularization is inevitable as long as new input/output information
gets into the system. As a result input data matrix, at k=N+1, has the following form:
Ψ(N + 1) =
mc(N + 1) mc(N) · · · mc(1) cmc(N + 1) cmc(N) · · · cmc(1)
λ1/2Ψ(N)
,
=
ϕT (N + 1)
λ1/2Ψ(N)
.(4.19)
As might be seen, the matrix Ψ(N + 1) (4.19) is no longer in a triangular form despite
Ψ(N) is. Hence, it is necessary to triangularize the matrix Ψ(N + 1) using the orthogonal
triangularization approach, such as Givens rotation. Alternatives of orthogonal triangularization
approaches are Householder transformation or Gram-Schmidt orthogonalization. However, due
4.3. ADAPTIVE WEIGHTED QR RECURSIVE LEAST SQUARE ALGORITHMS 83
to the recursive nature of RLS algorithm, Givens rotation is more convenient in our research.
In the Givens rotation approach, the aim is to eliminate each element in the first row of the
input data matrix (4.19). This is achieved by multiplying Ψ(N + 1) with a series of Givens
rotation matrices. A Givens rotation matrix in this series is in the form given in the Equation
(4.20):
Qθi(k) =
cos(θi(k)) · · · 0 · · · − sin(θi(k)) · · · 0
......
...
... Ii 0 · · · 0
......
...
sin(θi(k)) · · · 0 · · · cos(θi(k)) · · · 0
......
... IN−i
0 · · · 0 · · · 0
. (4.20)
In the Equation (4.20), Givens rotation angle θi(k) is calculated in a way that overall orthog-
onalization purpose eliminating the first row elements of input data matrix (4.20) is achieved.
Each matrixQθi(k) in the series of the Given Rotation matrices Q(k) actually eliminates a single
element in the first row. According to this fact, θi is calculated using the following expressions:
cos(θi(k)) =Ui(k)i+1,2∗N+1−i
ci,
sin(θi(k)) =ϕi(k − 2N − i)
ci,
where ci is
ci = 2
√Ui(k)2
i+1,N+1−i + ϕi(k − 2 ∗N − i)2.
(4.21)
Then, this set of Given rotation matrices will be multiplied with the input data matrices. As
might already be observed, the size of rotation matrices is equal to the number of the elements
in the first row of the input data matrices. For an input data matrix Ψ(k) (4.12) which has a 2N
elements in its first row, the Given rotation matrices set includes 2N elements as shown in the
84 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL
Equation (4.22):
Qθ(k) = Qθ2∗N(k)Qθ2∗N−1
(k) · · ·Qθ2∗N−i(k) · · ·Qθ0(k). (4.22)
Recursively at each time instant k, rotation angles θ2∗N · · · θ0 are calculated and corresponding
Qθimatrix is derived using the scheme defined in the Equation (4.20). The final step is the mul-
tiplication of each of these Given rotation matrices with input data matrix; and, the derivation
of the triangularized input matrices.
3rd Stage: QR-Decomposition RLS Algorithm
In this section, the triangularization procedure so far discussed is applied to the Recursive Least
Square (RLS) problem; and the algorithm is expected to be more stable and well-conditioned
since it avoids the calculation inverse autocorrelation matrix of conventional RLS algorithms.
The conventional recursive least square(RLS) algorithm considers the posterior error as the
difference between actual measurement and estimate. However, in our research,these terms
refer to actual miss count and estimated miss count, respectively.
ε(k) =
ε(k)
λ1/2ε(k − 1)
...
λk/2ε(0)
=
mc(k)
λ1/2mc(k − 1)
...
λk/2ε(0)
−Ψ(k)w(k) (4.23)
In Equation (4.23), the posterior weighted error vector is written as a function of input data
matrix, actual output data and parameter coefficient vectors. According to the QR approach,
both sides of the posterior error vector equation (4.23) are premultiplied by Givens rotation
4.3. ADAPTIVE WEIGHTED QR RECURSIVE LEAST SQUARE ALGORITHMS 85
matrix (4.22):
Q(k)ε(k) = Q(k)mc(k)−Q(k)Ψ(k)w(k),
= mcq(k)−
0
U(k)
w(k),
where
εq(k) =
εq1(k)
εq2(k)
...
εqk+1(k)
,
and
mcq(k) =
mcq1(k)
mcq2(k)
...
mcqk+1(k)
,
(4.24)
Due to the fact that Givens rotation matrix Q(k) is orthogonal, the cost function or weighted
square error ξd(k) of the QR decomposed RLS problem and of the conventional RLS problem
is equivalent as proved in the following equation:
ξd(k) = εTq (k)εq(k),
= εT (k)QT (k)Q(k)ε(k) = εT (k)Iε(k).(4.25)
The weighted square error in the Equation (4.25) is minimized using a back-substitution algo-
rithm. This is achieved by deriving w such that εqk−N+1to εqk+1
are set to zero in the Equation
(4.26) [Diniz, 2008]:
wi(k) =−∑i
j=1 u2N+1−i,i−j+1(k)wi−j(k) +mcqk+1−i(k)
uN+1−i,i+1(k), (4.26)
86 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL
for i = 0, 1, · · · 2 ∗N , where∑i−1
j=1 = 0.
More precisely, QR actual miss vector can be written as a function of a weighted posterior
vector given in the Equation (4.27).
mcq(k) =
mcq1(k)
−−−−
mcq2(k)
=
mcq1(k)
...
mcqk−2∗N (k)
−−−−−
mcqk−2∗N+1(k)
...
mcqk+1(k)
=
εq1(k)
...
εqk−N(k)
0
...
0
+
0
U(k)
w(k), (4.27)
where w(k) is the optimum coefficient vector at instant k. From the Equation (4.27), it can
easily be observed thatmcq1 in fact does not depend on the parameter coefficient vector w(k);
mcq1 =
mcq1
...
mcqk−2∗N
=
εq1(k)
...
εqk−N(k)
, (4.28)
whereas, mcq2 is equivalent to matrix multiplication of U with parameter vector w(k) as shown
in the Equation (4.29):
mcq2 =
mcqk−2∗N+1
...
mcqk+1
= U (k)w(k). (4.29)
Indeed, the Equation (4.29) is a back-substitution equation given in (4.18) in matrix notation.
In addition, due to the fact that Q is the multiplication of a series of orthogonal Givens rotation
4.3. ADAPTIVE WEIGHTED QR RECURSIVE LEAST SQUARE ALGORITHMS 87
matrices Q2N · · ·Q0, it is possible to reach the conclusion that Q can be rewritten in a recursive
form. According to the Equation (4.23), the input data matrix can be triangularized using Givens
rotation matrix Q:
Q(k)Ψ(k) =
0
U(k)
. (4.30)
Now, using the fact that Givens matrices are orthogonal (QTQ = I), the Equation (4.30) can be
written in the following form:
Q(k)
1 0T
0 QT (k − 1)
1 0T
0 Q(k − 1)
ϕT (k)
λ1/2ψ(k − 1)
=
0
U(k)
. (4.31)
Considering
Q(k) = Q(k)
1 0T
0 QT (k − 1)
, (4.32)
then the Equation 4.31 can be rewritten as :
Q(k)
ϕT (k)
λ1/2ψ(k − 1)
=
0(k−N)×(N+1)
U (k)
. (4.33)
This is in fact equal to
Q(k)
ϕT (k)
0(k−N−1)×(N+1)
λ1/2U (k − 1)
=
0(k−N)×(N+1)
U(k)
. (4.34)
Using the Equations (4.32) and (4.34), it is possible to deduce the structure of Q. In fact,
its structure includes an identity matrix Ik−N−1, such that it is represented in the following
88 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL
structure [Apolinario and Miranda, 2009]:
Q(k) =
∗ 0 · · · 0 ∗ · · · ∗
0 0 0
... Ik−N−1...
0 0 0
∗ 0 · · · 0 ∗ · · · ∗
...... . . . ...
... . . . ...
∗ 0 · · · 0 ∗ · · · ∗
. (4.35)
According to the given structure of Q, it is feasible to degrade the ever increasing order of
Q, which is (k + 1) × (k + 1). This is achieved by removing the Ik−N−1 section along with
corresponding columns and rows. Hence, Givens rotation matrices given in (4.22) with size of
(N +2)× (N +2) are constructed. Using this matrix, the following equation, which defines the
miss count vector in terms of actual miss count, Given Rotation matrices and triangularized miss
count vector is obtained, as shown in the Equation (4.36). In this way, the other intermediate
quantities are eliminated.
mc(k) =
εq1(k)
mcq2(k)
= Qθ(k)
mc(k)
λ1/2mcq2(k − 1)
. (4.36)
The Equation (4.36) can also be rewritten in the form where Givens rotation matrices are
decomposed into a series of Givens Rotation matrices as follows:
mc(k) = Qθ2∗N (k)Qθ2∗N−1(k) · · ·Qθi
(k)
mci(k)
mcq2i(k − 1)
, (4.37)
where Qθi(k) can be derived using the Equation (4.22). In addition, it is worthwhile to state
that a posterior output error can be computed without explicit derivation of w [Diniz, 2008].
4.3. ADAPTIVE WEIGHTED QR RECURSIVE LEAST SQUARE ALGORITHMS 89
This provides significant computational cost saving.
The formulations and derivations of the QR RLS Algorithm so far given throughout the
section can be summarized with a few remarks as:
1. For a given input and output data, the initial parameter vector is constructed using the
back-substitution method given in the Equation (4.18).
2. The QR triangularization algorithm is applied to the posterior error vector equation given
in the Equation (4.23); a set of Givens rotation matrices is generated using the definition
(4.22). Each of these rotation matrices is applied iteratively to both sides of the equation
in order to obtain the triangularized equation given in the Equation (4.24).
3. From these well-conditioned triangularized vectors and matrices, the relation between
output and input data matrices is formulated in terms of the unknown parameter coeffi-
cient vector w(k) given in the Equation (4.29).
4. Based on this relation defined by the Equation (4.29), a back-substitution algorithm is
constructed to calculate unknown parameter vector elements.
90 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL
Algorithm 1 Adaptive QR Recursive Least Square Estimation Algorithm
1. Initialize parameter vectorw, input data matrixX(N), upper triangular matrixU0(N+1)and system output vectormcq20(N + 1):
x2N×1(0) =
mc(k)
mc(k − 1)...
mc(k −N)
cmc(k)
cmc(k − 1)...
cmc(k −N)
(4.38)
For i=1 to 2NDo for j=1 to N
if(N + 2− i− j > 0)
X(i, j) = −λi−12 ∗mc(N + 2− j − i) (4.39)
EndEndDo for j=N+1 to 2N
if((2N + 2− j − i) > 0)
X(N)(i, j) = λ(i−1)/2 ∗ cmc(2N + 2− j − i) (4.40)End
EndEndFor k=1 to N
Do for i=1 to k
wi(k) =
∑ij=1 x(j)wi−j(k) +mc(i)
x(0)(4.41)
EndEnd
U′0(2N + 1) = λ1/2X(2N + 1) (4.42)
mc′q20(2N + 1) =
[λ1/2mc(2N) λmc(2N − 1) · · · λ(N+1)/2mc(0)
](4.43)
4.3. ADAPTIVE WEIGHTED QR RECURSIVE LEAST SQUARE ALGORITHMS 91
2. For each time cycle k, 2N + 1 ≤ k, compute:
γ−1 = 1, (4.44)
mc′
0(k) = mc(k), (4.45)
x′
0 = xT (k). (4.46)
(a) Triangularization of input matrixX and output vectormc
i. Calculation of Givens Rotation Matrices elements: cosθi and sinθi
Do for i=0 to 2N (4.47)
ci =√
[U′
i(k)]2i+1,2N+1−i + x′i2(2N + 1− i) (4.48)
cos(θi) =[U′
i(k)]i+1,2N+1−i
ci(4.49)
sin(θi) =x′2i (2N + 1− i)
ci(4.50)x′ i+1(k)
U′i+1(k)
= Q′
θi(k)
x′i
U′i(k)
(4.51)
γ′
i = γ′
i−1 cos(θi) (4.52) mc′i+1(k)
mc′q2(i+1)(k)
= Q′
θi(k)
mc′i
mc′q2i(k)
(4.53)
End (4.54)
mc′q20(k + 1) = λ1/2mc
′q22N+1
(k + 1) (4.55)
U′
0(k + 1) = λ1/2U′
2N+1(k) (4.56)
γ(k) = γ′
2N (4.57)
ε(k) = d′2N+1(k)γ(k) (4.58)
ii. Back Substitution Algorithm is applied to find parameter coefficients
mc =[mc
′2N+1mc
′
q22N+1(k)]
(4.59)
w0(k) =mc2N+1(k)
U2N,1(k)(4.60)
Do for i=1 to 2N (4.61)
wi(k) =−∑i
j=1 U2N+1−i,i−j+1wi−j(k) +mc2N+2−i(k)
U2N+1−i,i+1(k)(4.62)
End (4.63)End (4.64)
92 CHAPTER 4. PARAMETER ESTIMATION IN ADAPTIVE CONTROL
4.3.3 Complexity Analysis of QR-RLS Algorithm
The algorithmic complexity of QR-RLS algorithm actually improve complexity of conventional
least square algorithm (RLS) from Θ(n3) to Θ(n2) [Apolinario and Miranda, 2009]. This
complexity can be improved to Θ(n) using Fast QR-RLS algorithms; however, due to scope
and time frame of our research, further theoretical analysis and the Fast QR-RLS algorithm are
recommended as a future work.
4.4 Concluding Remarks
In this chapter, a proprietary parameter estimation algorithm is applied to our adaptive self-
tuning control framework. Due to the simplicity and low computational cost as a recursive
process repeated every control period, Recursive Least Square (RLS) estimation algorithm
family is chosen. Among the family of recursive least square algorithms, adaptive weighted
QR Recursive Least Square (RLS) is considered due to the fact that conventional Recursive
Least Square (RLS) algorithm suffers from ill-conditioned estimation process. On the basis of
theoretical facts, an algorithm is developed for further simulation and experiment purposes. The
performance of our algorithm is validated at the Experimental Setup and Simulation chapter.
Chapter 5
Algebraic Controller Design Methods
In this chapter, the controller design module of our adaptive self-tuning control framework
is developed. The controller design problem, which is, in fact, a recursive controller design
problem for a given time varying plant constraints, is addressed using algebraic controller
design methods. At the end of this chapter, our aim is to formulate a controller design problem
in terms of varying system constraints.
5.1 Introduction and Background
Algebraic controller design methods address to a particular category of modern control design
problems where the controller has a pre-specified structure. In this case, it is only required to
determine the controller parameters for which certain closed-loop performance requirements are
met. These algebraic system design methods solve many control design problems such as input-
output decoupling and exact model matching design. In the context of adaptive control, these
algebraic methods addressing pole placement problems are named adaptive pole placement
methods [Astrom and Wittenmark, 1994]. Ioannou and Fidan categorize the adaptive pole
placement methods into two as:
Direct Adaptive Pole Placement In the first approach, so called direct adaptive pole place-
ment (APPC), the system dynamics can be parametrized in terms of controller parameters.
In this case, the parameter estimation algorithms produce explicitly controller parameters
rather than system parameters. That is, the parameters of control law are generated
directly by solving algebraic equations without any intermediate step involving estimates
93
94 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
of system parameters. However, the direct APPC approaches are only applicable to
a special class of systems where controllers can be expressed in terms of controller
parameters. This type of approach is deployed in the Direct(Implicit) Adaptive Control
Scheme 3.2, Model Reference Adaptive Control Scheme, and Dual Adaptive Control
Systems.
Indirect Adaptive Pole Placement In the second approach, indirect adaptive pole placement
(APPC) the recursive parameter estimation algorithms identify system or plant parameters
rather than directly estimating controller parameters. These system or plant parameters
are input into control design algorithms such as pole placement to find the corresponding
controller parameters. Such an approach is common in the Indirect Adaptive Control
Scheme 3.1. Indirect adaptive pole placement (APPC) schemes are easy to design and
can be applicable to a wide class of systems; however, the uncertainty in the parameter
estimation might lead to instability in the controller. In this case, this issue can be
eliminated by using advanced control methodologies such as linear quadratic controller
design, or robust control system design methodologies.
In our adaptive self-tuning control framework, the indirect adaptive pole placement approach
is used with simple control design methodology, which creates a linear controller at each
control period for a given set of system or plant parameters and a reference signal or command.
Figure 5.1 illustrates a very basic adaptive self-tuning control framework with a general linear
controller with two degrees of freedom. In this case, for a given single input single output
system, system or plant can be defined with the following equations:
A(q)y(t) = B(q)(u(t) + v(t)), (5.1)
where y(t) is the output, u is the input and v is a disturbance. A and B are polynomials, which
are represented by the forward shift operator q powers, have degrees deg A=n and degB =
degA− degd0, where d0 is pole excess. In addition, a general controller can be described by
Ru(t) = Tuc(t)− Sy(t), (5.2)
where R, S and T are polynomials. The control law (5.2) u(t) = TRuc(t) − S
Ry(t) refers to
negative feedback (−y(t)) with the transfer function of S/R, and a feedforward (uc(t)) with
5.1. INTRODUCTION AND BACKGROUND 95
the transfer function T/R. Since the controller has two independent components, feedback
and feedforward, which determine control input u(t), the controller has a ’two degrees’ of
freedom. In this instance, a closed loop system shown in Figure 5.1 can be used to describe the
closed-loop system corresponding to this system and controller. The closed-loop characteristic
Figure 5.1: A General Linear Controller with Two Degrees of Freedom [Astrom andWittenmark, 1994]
polynomial can be derived for the closed-loop system shown in Figure 5.1 after a few steps of
derivation by neglecting u as:
y(t) =BT
AR +BSuc(t) +
BR
AR +BSv(t), (5.3)
u(t) =AT
AR +BSuc(t)−
BS
AR +BSv(t), (5.4)
AR +BS = Ac. (5.5)
In this regard, the key objective is to specify the desired closed loop characteristic Ac, and solve
R and S for given system parameters A and B. At this stage, the algebraic controller design
algorithms derive this system level controller problem in terms of polynomial coefficients.
Then, to solve the controller coefficients in terms of system polynomial coefficients, these
design algorithms use some well settled mathematical theorems such as Diophantine Equations,
Uncertain Coefficient Methods. As a result, a set of controller parameters for a given time
varying system and for the pre-determined reference closed loop is calculated. In fact, using an
algebraic design approach allow this strategy to be applied in a recursive manner.
Depending on the closed loop system response characteristic, algebraic controller design
algorithms can vary. In this thesis, two algebraic controller design algorithms are investigated:
deadbeat controller design algorithm and adaptive pole placement algorithm. While deadbeat
controller design algorithm proposes solid but not practical closed loop system response, pole
placement controller design algorithm brings a flexibility to designers to determine closed loop
96 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
system response. In the following section, these two algebraic control design algorithms are
applied into our adaptive self-tuning control framework.
5.2 Theoretical Formulation
As discussed previously, our adaptive self-tuning control framework involves estimation of
plant parameters, and the design of a controller. These two tasks also guarantee the tracking
of reference signals in highly unpredictable environments. In this chapter, the design of a con-
troller is covered using algebraic controller methods, dead beat controller and pole placement
algorithm.
5.2.1 Deadbeat Controller Design
Dead beat controller design algorithm is an endeavor to find a set of controllers which achieves
zero level tracking error in a finite number of control steps. According to the control theory,
this is achieved by selection of controller parameters so that closed system poles are located
at the origin of z plane. The pole at the origin refers to a pulse in the time domain; in other
words, minimum translation time and infinite controller gain. However, this design method is
easily applicable to linear closed systems, and it is still an open research question for non-linear
systems.
There are two version of deadbeat design method: strong version and weak version. In the
strong version, the closed loop achieves the steady state after a finite number of control steps;
whereas, in the weak version, the closed loop system only converges to the steady state in a finite
number of control steps. In our project, the strong version of the deadbeat controller design
algorithm is covered. In this section, using the strong version of the deadbeat algorithm, an
algebraic formulation of the controller design problem is achieved for a given plant parameters
and a reference signal.
For a given dynamic thread cache access pattern model given in the Equation (3.30) and
derived parameter coefficient vectors in the Equation (4.26) or (4.17), it is possible to find such
a controller that the closed loop system response achieve zero tracking error in a finite number
of steps.
5.2. THEORETICAL FORMULATION 97
According to the dynamic cache access pattern model in the Equation (3.30) and the adap-
tive closed loop system model given in the Equation (3.23), it is possible to derive a closed
loop transfer function, and formulate the closed loop system response in terms of an unknown
controller and known plant characteristic polynomials. More specifically, the basic steps of the
deadbeat controller design algorithm are summarized in the following Algorithm 2.
So far, a brief discussion on the procedure has been conducted through the Algorithm 2.
However, it is still necessary to stress some additional points in the algorithm in more details.
1. First of all, it is necessary to further discuss polynomial orders given in the Equation
(5.14). To begin with, it is worthwhile to mention that in the course of developing an
adaptive self-tuning control framework, a set of design decisions has been conducted on
the structure and complexity of plant and reference signals; for instance, the degree of the
plant transfer function A(z−1) and B(z−1), and the reference signal transfer function
ICCacheDedicated = NIC(z−1)DIC(z−1)
structures and orders. Namely, polynomial A(z−1) and
B(z−1) are constructed according to the parameter estimation equation (4.62), which
is in fact a representation plant of closed loop system. In our case, we consider plant
dynamics as a second order transfer function composed of polynomial equations A(z−1)
and B(z−1). For estimated plant coefficients ai and bi where i=0,...2, the polynomial
equations A(z−1) and B(z−1) can be written as:
A(z−1) = a0 + a1z−1 + a−2
2 ,
B(z−1) = b0 + b1z−1 + b−2
2 .(5.15)
At this stage, other important design decisions can be considered such as the construc-
tion of the reference signal structure. According to polynomial degree criterion given
in the Equation (5.14), second order reference signal ICCache−Dedicated = NIC(z−1)DIC(z−1)
is
constructed:NIC(z−1) = icCache−Dedicated,
DIC(z−1) = (1 + (0.8)z−1 + (0.25)z−2).(5.16)
Finally, based on all design decisions made, it is possible reach the conclusion on the order
of the controller polynomial system based on the past design choices and the criterion
equation (5.14). As a result, the controller plant transfer function polynomials orders are
98 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
Algorithm 2 Deadbeat Controller Design Algorithm
1. For a given state-space models (3.23) and (3.30), overall closed loop transfer function interms of estimated plant parameters and unknown controller parameters is obtained:
IC(k)
ICCache−Dedicated(k)=
Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))
P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)),
(5.6)
where A(z−1), B(z−1) refers to the cache access pattern (miss count) dynamic transferfunction, P (z−1), Q(z−1) indicates the controller transfer function, CP is the cache misspenalty, and K = CPIIdeal + Delay(k) is the sum of ideal cycle per instruction for thethread and memory access delay.
D(z−1)
C(z−1)= CPIIdeal +Delay(k) + CP
B(z−1)
A(z−1)Cmc(k) (5.7)
2. From the closed loop transfer function in the Equation (5.6), a number of solvablealgebraic polynomial system equations, so called Diophantine equations is derived:
A(z−1)P (z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = 1, (5.8)
ICCache−Dedicated dedicated cache reference signal is considered as the transfer functionNIC(z−1)DIC(z−1)
. (5.9)
After deriving first Diophantine in the Equation (5.8), tracking error of the closed loopsystem can be formulated as:
E(z−1) = [1− Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))
P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))]NIC(z−1)
DIC(z−1),
(5.10)
= [(1−Q(z−1))(K ∗ A(z−1) + CP ∗B(z−1))]NIC(z−1)
DIC(z−1). (5.11)
Under the assumption that division of (1−Q(z−1))(K ∗A(z−1) +CP ∗B(z−1)) by DIC
polynomial also result in a polynomial S(z−1)
: S(z−1) =1−Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))
DIC(z−1), (5.12)
which leads to the last Diophantine equation
S(z−1)DIC(z−1) +Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = 1. (5.13)
5.2. THEORETICAL FORMULATION 99
3. Based on the plant transfer function polynomials degrees, which might be consideredas a design decision done at the parameter estimation stage, it is likely to estimate thepolynomial degrees for the controller transfer function as follows:
∂S(z−1) = max(∂A(z−1), ∂B(z−1))− 1,
∂Q(z−1) = ∂DIC(z−1)− 1,
∂Q(z−1) = ∂A(z−1)− 1,
∂P (z−1) = max(∂B(z−1), ∂A(z−1)).
(5.14)
4. Using the fact given in the Equations (5.14), it might be a good approach to rewrite theDiophantine Equations (5.13), (5.8) in a polynomial notation.
5. It might be considered as a good strategy to use the Uncertainty Coefficient method toderive polynomial coefficients of the controller transfer function.
obtained as 1st order polynomial Q(z−1) and 2nd order polynomial P (z−1):
Q(z−1) = q0 + q1 ∗ z−1, (5.17)
P (z−1) = p0 + p1 ∗ z−1 + p2 ∗ z−2, (5.18)
where the coefficients qi and pj are unknown. In fact, algebraic controller design methods
are seeking for an expression, which writes these coefficients in terms of known plant and
reference polynomial equations.
2. In this case, the Diophantine Equation (5.8) is a good start to derive an expression stating
the relationship between the plant and controller polynomial equations. On the Dio-
phantine Equation (5.8), the Uncertain Coefficient Method is applied to express con-
troller polynomial equations in terms of other known plant and reference polynomial
coefficients. Indeed, uncertain coefficient method is an endeavor to solve an unknown
polynomial by equating coefficients of equal power terms. This approach leads to a
system of nonlinear equations, in which the solution corresponds to controller polyno-
mials coefficients. By rewriting the Diophantine Equation (5.8) in terms of polynomial
100 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
coefficients, the following set equations are derived:
A(z−1)P (z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = 1,
(p0 + p1 ∗ z−1 + p2 ∗ z−2)(a0 + a1 ∗ z−1 + a2 ∗ z−2)+
(q0 + q1 ∗ z−1)K(a0 + a1 ∗ z−1 + a2 ∗ z−2)+
(q0 + q1 ∗ z−1)CP (b0 + b1 ∗ z−1 + b2 ∗ z−2) = 1.
(5.19)
It is more convenient for the Uncertain Coefficient Method to express the Equation (5.19)
in a polynomial form; and using the distribution property, the following expansion is
achieved:
p0a0 + (p0a1 + p1a0)z−1 + (p0a2 + p1a1 + p2a0)z−2 + (p2a1 + p1a2)z−3 + (p2a2)z−4+
Kq0a0z−1 + (Kq0a1 +Kq1a0)z−2 + (Kq1a1 +Kq0a2)z−3 + (Kq1a2)z−4+
(CPq0b0)z−1 + (CPq0b1 + CPq1b0)z−2 + (CPq1b1 + CPq0b2)z−3 + (CPq1b2)z−4 = 1.
(5.20)
The resultant nonlinear system of equations can be expressed either in a matrix form
(5.21) or in an algebraic form as a set of nonlinear equations (5.22). Despite the fact that
a matrix form of representation is only a compact representation of an algebraic one, for
the sake of computational simplicity, it is a good approach.
a0 0 0 0 0
a0 a0 0 (Ka0 + CPb0) 0
a2 a1 a0 (Ka1 + CPb1) (Ka0 + CPb0)
0 a2 a1 (Ka2 + CPb2) (Ka1 + CPb1)
0 0 a2 0 (Ka2 + CPb2))
p0
p1
p2
q0
q1
=
1
0
0
0
0
(5.21)
In algebraic form, the controller polynomial coefficients are derived by equating coeffi-
cients of equal powers at both sides of the Equation (5.20). As a result the following set
5.2. THEORETICAL FORMULATION 101
of coefficients is inferred:
p0a0 = 1,
(p0a0 + p1a0 +Kq0a0 + CPq0b0) = 0,
(p0a2 + p1a1 + p2a0 +Kq0a1 +Kq1a0 + CPq0b1 + CPq1b0) = 0,
(p2a1 + p1a2 +Kq1a1 +Kq0a2 + CPq1b1 + CPq0b2) = 0,
(p2a2 +Kq1a2 + CPq1b2) = 0.
(5.22)
3. Despite both of these uncertain coefficient methods having a different approach, in fact,
both give the same set solution for controller transfer function coefficients. To provide
a better, simpler presentation, a number of constants given in the Equation (5.23) are
defined and used in the controller coefficient expressions.
H1 = (K ∗ a2 + CP ∗ b2)− ((K ∗ a0 + CP ∗ b0) ∗ (a2/a0)),
H2 = (K ∗ a1 + CP ∗ b1 − (K ∗ a2 + CP ∗ b2) ∗ (a1/a2)),
H3 = (a1 ∗K + CP ∗ b1)− (a1/a0) ∗ (K ∗ a0 + CP ∗ b0),
H4 = (K ∗ a0 + cp ∗ b0)− (a0/a2) ∗ (K ∗ a2 + cp ∗ b2),
H5 = −(1/a0) ∗ a2 − (1/a02) ∗ (a1
2),
H6 = (1/a02) ∗ a2 ∗ a1.
(5.23)
q0 = (H5 ∗H2 −H4 ∗H6)/(H3 ∗H2 −H4 ∗H1),
q1 = (H6 −H1 ∗ q0)/(H2),
p0 = 1/a0,
p1 = −(1 ∗ a1/(a02)− (K ∗ a0 + CP ∗ b0) ∗ (q0/a0)),
p2 = −((K ∗ a2 + CP ∗ b2)/(a2)) ∗ q1.
(5.24)
To sum up, in this section we have successfully derived the controller polynomial coeffi-
cients or parameters in terms of estimated plant parameters and other known variables. This set
of equations (5.24) eases the integration of numerical algebraic controller design into our cache
aware adaptive closed loop scheduling framework. Despite the fact that this method provides
fairly simple algebraic controller design strategy, the major drawback is the solid closed loop
response with pre-set fixed zero poles and this actually leads to the fast stabilization of process
102 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
output. To be more precise, fast stabilization requires control input with large peaks in a short
period of time; in other words, the faster the stabilization the higher the gain of the actuator.
Hence, in real application, this not always applicable due to saturation in physical components.
Hence, the pole placement controller design algorithm is proposed as a widely applicable and
flexible alternative.
Deadbeat Algorithm Complexity Analysis
For a fixed pole location at the origin, the deadbeat controller design algorithm is, in fact, a
solution of linear matrix equation Ax = b given in the Equation 5.22. Hence, the complexity
of the deadbeat algorithm is Θ(n3) with a Gaussian Elimination method and Θ(nn) using any
polynomial method. Due to the large algorithmic complexity of the deadbeat algorithm, the
system order is very critical in the computational efficiency of the overall framework.
5.2.2 Pole Placement Controller Design
The pole placement controller design algorithm derived a set of controller characteristic poly-
nomial coefficients for a reference pre-designed closed loop system response and given time
varying plant polynomial coefficients. In fact, the main add-on compared to deadbeat controller
design is the designer control specifying the closed loop system response through the location
of the poles of closed loop characteristic polynomial. In this way, it is possible for the designer
to ensure the stability as well as the required characteristics of closed loop response, such as
settling time, maximal overshoot, damping ratio. In fact, this adds complexity for the designer
to make design decisions on closed loop system behavior. On the other hand, this method
increases the flexibility and applicability of this approach on the various control scenarios.
In our project, the following steps are realized to apply pole placement method to our
controller design problem.
1. For the given state-space models (3.23) and (3.30), formulate the closed loop transfer
function:
NGDes
DGDes
= GDesired(z−1) =
Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))
P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)),
(5.25)
5.2. THEORETICAL FORMULATION 103
where A(z−1), B(z−1) refer to the cache access pattern (miss-count) dynamics transfer
function, P (z−1), Q(z−1) polynomials refer to the controller transfer function, CP is
cache miss penalty, and K = CPIIdeal + Delay(k) is sum of ideal cycle per instruction
for the thread and memory access delay; and GDesired(z−1) refers to desired closed loop
transfer function response.
2. From the closed loop transfer function (5.25), derive a number of solvable algebraic
polynomial system equations, so called Diophantine equations. Here, please note that
previously in the deadbeat controller case, the poles are placed at z=0. As a result,
A(z−1)P (z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗ B(z−1)) = 1 in the Equation (5.8).
However, since the specific pole location is also a possible design decisions for the
designer, it is necessary to consider the desired pole locations as well in the formulation
previously derived for Deadbeat Controller Design case. As a result, the Diophantine
equation is formulated as:
A(z−1)P (z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = DGDesired(z−1). (5.26)
3. Derive the error tracking function E(z−1)
E(z−1) = ICCache−Dedicated(z−1)− IC(z−1), (5.27)
where ICCacheDedicated(z−1) refers to reference cache dedicated instruction count, which
is in our case NIC(z−1)DIC(z−1)
, and IC(z−1) point to actual instruction count.
E(z−1) = [1− Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))
P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))]NIC(z−1)
DIC(z−1)
(5.28)
Considering the fact that the Diophantine is
P (z−1)A(z−1) + z−1Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = DGDes(z−1), (5.29)
and under the assumption that DIC(z−1) divides the numerator of tracking error transfer
function (5.28) and denotes this ratio as S(z−1), the second Diophantine equation is
104 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
deduced:
S(z−1) =DGDes(z
−1)−Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1))
DIC(z−1),
which leads to Diophantine equation
S(z−1)DIC(z−1) +Q(z−1)(K ∗ A(z−1) + CP ∗B(z−1)) = DGDes(z−1).
(5.30)
4. Considering the Diophantine Equations (5.26), (5.30) and closed loop transfer function
(5.25), it is possible to reach a conclusion about the degrees of the polynomials. However,
in contrast with previous approaches, Diophantine equation (5.26) includes DGDes(z−1);
in fact, this term add the complexity to the system, particularly in determining polynomial
degrees of the controller ∂Q(z−1), ∂P (z−1) and closed loop transfer function degree.
According to Bobal et al. Bobal et al. [2005], the following criterion is considered:
∂DGDes(z−1) ≤ ∂A(z−1) + ∂max(A(z−1), B(z−1)) + 1− 1. (5.31)
If this condition is valid, then the same polynomial degree relation in deadbeat method
(5.14) is also valid for pole placement algorithm. In case the condition is not met,
∂P (z−1) and ∂Q(z−1) cannot be uniquely determined. In our case, according to the QR
RLS parameter estimation algorithm, the polynomials of the system (thread execution
pattern) plant are second order polynomials.
∂A(z−1) = ∂B(z−1) = 2. (5.32)
Hence, in order to meet criteria given in the Equation (5.31), ∂DGDes(z−1) ≤ 4 should
be valid.Under the assumption that that criterion is met, it is possible to calculate the
polynomial degrees for a known plant polynomial degrees as:
∂A(z−1) = 2, (5.33)
∂B(z−1) = 2, (5.34)
∂P (z−1) = 2, (5.35)
∂Q(z−1) = 1. (5.36)
5.2. THEORETICAL FORMULATION 105
For these polynomials, the closed loop system polynomial degree using the closed loop
system polynomial is calculated (5.25)
: ∂DGDes(z−1) = max((∂A(z−1) + ∂P (z−1)), ∂A(z−1) + 1 + ∂Q(z−1)). (5.37)
From this set of equations (5.37), it is obvious that the closed loop polynomial degree is
four. Hence, now it is necessary to design the desired fourth order closed loop response.
5. In the closed loop system pole assignment, the closed poles are designed in continuous
time domains based on the design decisions regarding damping ratio, sampling period
and oscillation frequency. In our project, the following design decision on closed loop
characteristics has been made:
Oscillation Frequency wn√1−ς2 275kHz
Damping Ratio ς 0.54
Sampling Period T0 1.5e-06sec
Maximum Overshoot ∆y 13.32 %
Using these design constraints, two complex conjugate poles and two real poles are
derived:s1,2 = −9.33 ∗ e05± 1.45 ∗ e06,
s3 = −1.0094 ∗ e06,
s4 = −7.5962 ∗ e05.
(5.38)
Here, please note that only complex conjugate poles affect the damping and overshoot
of the system. Real poles have no damping and oscillation impact on the closed loop
system response. Apart from the closed loop system response, the stability of the system
should be under designer consideration. As might be realized, all poles are on the left
hand side of the real axis. As an internal validation, the impulse response of the designed
closed loop system is validated, and also the stability related analysis is conducted on
this system. Figure 5.2 indicate the continuous time characteristic of closed system. In
Figure 5.2, the step response of the designed closed loop system is given; that is, how
fast and accurate cache aware adaptive closed loop scheduling framework is adapted to
the change in reference cache dedicated instruction count. As might be seen, the closed
106 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
Figure 5.2: Step Response of Closed Loop Transfer Function
loop system has a slow but accurate response to changes in the reference input. In fact,
the accuracy refers to the percentage of the overshoot (13%), and speed of convergence
refers to settling time in Figure 5.2.
The stability of the closed loop system is another important factor to be considered. In
this regard, Figure 5.3 refers to the measure of robustness; in other words, how far the
system is away of the state of instability. The first figure from the left is a root locus plot,
which indicates system response against the variation of the system gain. The other two
sub-figures are bode plots, and refer to the robustness of stability. In bode plots, there
are two terms which have significance in terms of our stability analysis. The first term is
GM, gain margin, which in fact refers to the measure of stability. GM is the measure of
controller gain or any other random system gain which can lead to instability; here in our
design, the stability is guaranteed up to 104dB gain. Phase margin (PM) is associated with
robustness of the closed loop stability; in other words, the measure of resistance of the
closed loop system against the undesired oscillations in the closed loop system. Hence,
the bigger the phase margin is, the less the closed loop system oscillates due to noise in
the input or system. Our design has a phase margin of 114 degrees, which is well above
the average PM of 40 degrees. After validating the closed loop performance and stability
in continuous domain for the designed poles, it is necessary to map the continuous poles
to the discrete time ones using the exponential mapping.
5.2. THEORETICAL FORMULATION 107
Figure 5.3: Closed Loop Stability Analysis Bode Plots & Root Locus
108 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
Following the performance validation, it is necessary to map continuous-time poles to the
discrete-time domain using continuous to discrete domain mapping strategies. Despite the
fact that there are many different transformation techniques, the bilinear transformation is
used in our project. In fact, the bilinear transformation uses simple exponential translation
between discrete and continuous time domains:
zi = exp siT0, (5.39)
where as mentioned previously T0 refers to the sampling period and it is equal 1.5e-06sec
in our system. As a result of the bilinear transformation, the following discrete-time poles
are derived:z1,2 = −0.4± 0.3i,
z3 = +0.22,
z4 = +0.32.
(5.40)
6. Using the designed pole location stated in the Equation (5.40), the characteristic polyno-
mial DGDes(z−1) of the closed loop system is calculated as:
DGDes(z−1) = 1− 0.2596z−1 − 0.0201z−2 − 0.0131z−3 + 0.0043z−4, (5.41)
where for a given general polynomial of DGDes(z−1) = d0 + d1z
−1 + d2z−2 + d3z
−3 +
d4z−4, it is possible to write polynomial coefficients as below:
d0 = 1,
d1 = −0.2596,
d2 = −0.0201,
d3 = −0.0131,
d4 = 0.0043.
(5.42)
In the rest of the discussion, rather than having numerical values for the DGDes(z−1),
these parameters given in the Equation (5.42) are used to have a more general solution,
applicable even for different closed loop system design.
7. Using the first Diophantine Equation (5.26) and facts about the degrees of polynomial
5.2. THEORETICAL FORMULATION 109
characteristic functions (5.37), it is feasible to write Diophantine Equations (5.26) in the
polynomial form:
(p0 + p1 ∗ z−1 + p2 ∗ z−2)(a0 + a1 ∗ z−1 + a2 ∗ z−2) + (q0 + q1 ∗ z−1)K
(a0 + a1 ∗ z−1 + a2 ∗ z−2) + (q0 + q1 ∗ z−1)CP (b0 + b1 ∗ z−1 + b2 ∗ z−2)
= d0 + d1z−1 + d2z
−2 + d3z−3 + d4z
−4.
(5.43)
8. Next is to write the controller transfer function coefficients in terms of the plant transfer
function coefficients. To achieve this, the Uncertain Coefficient Method is applied to our
characteristic polynomial given in the Equation (5.43). However, since the Uncertain Co-
efficient Method in fact solves the coefficients by equating the equal power of polynomial
coefficients, it is necessary to expand the polynomial system in the Equation (5.43) as
follows:
p0a0 + (p0a1 + p1a0)z−1 + (p0a2 + p1a1 + p2a0)z−2 + (p2a1 + p1a2)z−3
(p2a2)z−4 +Kq0a0z−1 + (Kq0a1 +Kq1a0)z−2 + (Kq1a1 +Kq0a2)z−3
+ (Kq1a2)z−4 + (CPq0b0)z−1 + (CPq0b1 + CPq1b0)z−2
+ (CPq1b1 + CPq0b2)z−3 + (CPq1b2)z−4 = d0 + d1z−1 + d2z
−2 + d3z−3 + d4z
−4.
(5.44)
9. At this stage, the Uncertain Coefficient method produces a system of nonlinear equations,
which are in fact the solution to the Equation (5.44). This system of nonlinear equations
can be expressed either as a set of nonlinear equations or in matrix form:
a0 0 0 0 0
a0 a0 0 (Ka0 + CPb0) 0
a2 a1 a0 (Ka1 + CPb1) (Ka0 + CPb0)
0 a2 a1 (Ka2 + CPb2) (Ka1 + CPb1)
0 0 a2 0 (Ka2 + CPb2))
p0
p1
p2
q0
q1
=
d0
d1
d2
d3
d4
, (5.45)
110 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
p0a0 = d0,
(p0a0 + p1a0 +Kq0a0 + CPq0b0) = d1,
(p0a2 + p1a1 + p2a0 +Kq0a1 +Kq1a0 + CPq0b1 + CPq1b0) = d2,
(p2a1 + p1a2 +Kq1a1 +Kq0a2 + CPq1b1 + CPq0b2) = d3,
(p2a2 +Kq1a2 + CPq1b2) = d4.
(5.46)
Both of these representations for the set of nonlinear equations actually provides an
opportunity to express unknown controller polynomial coefficients in terms of estimated
plant coefficients and reference closed loop characteristic polynomials. The minor dif-
ference is that while the matrix form given in the Equation (5.45) requires an inverse
matrix operation, second expression in the Equation (5.45) requires a series of algebraic
operations.
10. As a result of the Uncertain Coefficient method, the following set of expressions is
obtained for controller transfer function polynomial coefficients. Here, in order to express
controller polynomial coefficients in more understandable form, the following constants
are defined :
H1 = (K ∗ a2 + cp ∗ b2)− ((K ∗ a0 + cp ∗ b0) ∗ (a2/a0)),
H2 = (K ∗ a1 + cp ∗ b1− (K ∗ a2 + cp ∗ b2) ∗ (a1/a2)),
H3 = (a1 ∗K + cp ∗ b1)− (a1/a0) ∗ (K ∗ a0 + cp ∗ b0),
H4 = (K ∗ a0 + cp ∗ b0)− (a0/a2) ∗ (K ∗ a2 + cp ∗ b2),
H5 = d2− (d0/a0) ∗ a2− (a1/a0) ∗ d1 + (d0/a02) ∗ (a12)− a0 ∗ (d4/a2),
H6 = d3− d4 ∗ (a1/a2)− d1 ∗ (a2/a0) + (d0/a02) ∗ a2 ∗ a1.
(5.47)
In fact, these constants in the Equation (5.47) are used in the controller transfer function
polynomial coefficient given below :
q0 = (H5 ∗H2−H4 ∗H6)/(H3 ∗H2−H4 ∗H1),
q1 = (H6−H1 ∗ q0)/(H2),
p0 = d0/a0,
p1 = (d1/a0− (d0 ∗ a1/(a02))− (K ∗ a0 + cp ∗ b0) ∗ (q0/a0)),
p2 = (d4/a2)− ((K ∗ a2 + cp ∗ b2)/(a2)) ∗ q1.
(5.48)
5.3. CONCLUDING REMARKS 111
Pole placement(assignment) in fact addresses drawbacks of deadbeat method and give flexi-
bility to the designer in determining closed loop response characteristics such as damping ratio,
overshoot, settling time and etc. This is a significant improvement, because in some systems, it
is hardly possible to control closed loop response with an high accuracy in a very short period
of time due controller or actuator limitations. In the control terminology, this is identified as
the fast stabilization of closed loop system, which leads to a saturation in the actuator or the
controller. In contrast, in pole placement method it is always possible for designer to decide on
degrees of accuracy and response times for particular application domains and specific systems.
Hence, the pole placement algorithm is applicable to a wide range of applications. However,
it is also inevitable that the complexity and computational cost of the pole placement method
increases significantly from the designer point of view since closed loop pole design is involved.
Pole Placement Algorithm Complexity Analysis
For a pre-designed pole location, the pole placement controller design algorithm is, in fact, a
solution of linear matrix equation Ax = b given in the Equation 5.45. Hence, the complexity
of this linear matrix equation, in fact, equals to the complexity of the pole placement algorithm,
which is Θ(n3) with a Gaussian Elimination method and Θ(nn) using any polynomial method.
Due to the large algorithmic complexity of the pole placement algorithm, the system order is
very critical in the computational efficiency of the overall framework.
5.3 Concluding Remarks
In this chapter, so far we have developed two alternative controller design methodologies with
different complexity and applicability. Despite the difference in both algorithms, both these
approaches in fact provide algebraic methods for finding controller polynomial coefficients or
parameters from a given plant and system parameters. In other words, these algebraic design
methods seek controller polynomial equations, which enforce the closed loop system toward the
reference or pre-set ideal closed loop characteristic for a given plant characteristic polynomial
at any given instant. In the context of our project, the algebraic controller design has a signifi-
cance since our cache dynamics are highly unpredictable and show time-varying characteristics.
Hence, it is necessary to estimate plant polynomials at a given instant, and passing these plant
112 CHAPTER 5. ALGEBRAIC CONTROLLER DESIGN METHODS
characteristics as polynomial coefficients to algebraic control design methods. In this way, a
specific controller is designed to keep the overall closed loop response at the ideal or pre-set
characteristics. No matter what cache performance is at the specific instant, overall instruction
count executed within that time will be always same.
Chapter 6
Experimental Setup and Simulation
This chapter is an endeavor to validate the theoretical findings and derivations in line with our
project aim and objectives. Namely, the question of whether the framework is able to match the
performance criteria and objectives is addressed through this section. To be more specific, it is
necessary to clarify the research objectives.
1. Our main objective is to discard the processing inefficiency caused by shared cache
resource allocation. Particularly, the co-runner cache access pattern dependence on the
thread processing efficiency is targeted. Our framework is settled on the fact that it is
inevitable to have a cache miss on a shared cache memory due to the fact that shared cache
memory is a limited memory unit which is shared among application threads with tempo-
ral cache memory demands. In such a cache environment, co-runner thread cache access
behavior has a significant impact on the thread’s performance. Hence, we propose the
argument that the threads having high memory latency due to co-runner cache dependency
can be compensated by allocating more processing cycles to them. In this way, we aim to
improve the time completion; that is, the minimization of performance degradation of the
thread due to co-runner dependence or shared resource by tracking dedicated (reference)
instruction cycle count is our main goal in this project. In this chapter, this argument
is validated by comparing instruction count with traditional scheduler and our adaptive
scheduler.
2. Our second objective is to provide a proactive framework rather than a reactive one. This
capability requires prediction or estimation capability in the system. In this respect, our
adaptive scheduling framework includes a parameter estimation module, which estimates
113
114 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
the thread cache access pattern, which determine the proactive control action to mini-
mize instruction count error of the thread. In this chapter this argument is validated, by
assessing the accuracy of cache miss estimation.
6.1 Experiment Design
In order to provide a consistent performance assessment, it is necessary to develop test scenar-
ios, which provide a consistent and comparable results. Ideal scenarios are able to reflect the
actual performance of a particular algorithm, and also their results should be comparable with
other algorithms having similar test cases. In general, any simulation or test scenario is in fact
characterized by three constraints: controlled, fixed and free constraints.
The controlled constraints are parameters, whose values are intentionally adjusted by the
experimenter in order to reach a particular conclusion. In our experiment, our controlled
parameter is co-runner threads or processes pairs. In contrast to the controlled constraints,
fixed constraints are the ones, whose values are kept constant throughout the experiment. The
main objective behind fixed constraints is to provide a consistency across experiment results. As
for free constraints, they are the ones, which are not controllable by the designer, and generally
assumed to have minor impact. In our experimental design strategy, two constraints fixed and
controlled are mainly used.
In our project context, controlled constraints aim to identify and observe different execution
patterns or resource demands of threads or processes. In fact, each different execution pattern
or resource demand corresponds to a particular adaptive scheduling scenario. Our adaptive
scheduling framework is considered as an efficient solution only if it can efficiently handle
all these scenarios. In this case, controlled constraints provides a complete picture on the
scheduling performance for different occurrences. Fixed constraints refers to the simulation
platform settings such as cache memory size, number of cores, cache replacement policy, com-
piler type, and some other computing platform settings. In our scenarios, cache size, number
of cores, cache replacement policies, core and cache architecture are kept fixed. In addition,
in our experiments, compiled benchmark instruction SPEC CPU2000 binaries are used; that
is, sample applications are compiled using the same specific instruction sets and converted
into binary/hardware codes. In this way, software platform dependence on performance is
6.2. EXPERIMENT SCENARIOS 115
eliminated. All in all, fixed constraints provide consistency among experiments, and also
allow only controlled constraints to have impact on the result of the experiment. That is, the
experiment designer keeps all constraints except controlled ones constant to observe the impact
of that particular constraint on the experiment.
Considering these experiment design elements, a number of experiments will be designed
to assess the performance characteristics of our scheduling framework. The following section
includes the development of experiments, and the rest of the chapter analyzes and evaluates the
experiment outputs.
6.2 Experiment Scenarios
6.2.1 Development of Experiment Constraints
In this section, the controlled and fixed constraints briefly mentioned in the previous section are
discussed in details in our project context.
Controlled Constraints
As mentioned previously, controlled constraints refers to the variant factors in the experiment.
In fact, the efficiency of the framework is measured with respect to these constraints. In our
experiment, co-runner pairs are considered as controlled constraints. Due to the fact that
each thread has a different memory and execution pattern, each co-runner thread pair shows a
different system behavior, and from the adaptive scheduling perspective, this requires different
identification complexity, different control action and different reference signal value. Hence,
each of the controlled pair of co-runner threads validates one particular aspect of the system. In
line with our project context, we create two sets of co-runner thread pairs according to cache
resource demands as:
1. Heavyweight Co-Runner Threads This pair of threads refers to the thread having a high
volume cache demand and showing non-temporal streaming cache behavior. Streaming
application such as video coding applications, computing platforms belongs to this class
of threads. For instance, SPEC CPU 2000 integer benchmark mcf requires up 1.7GB
memory requirement and swim also requires cache bigger that 16MB. Hence, swim mcf
116 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
pair can be considered as heightweight co-runner pairs. For heavyweight co-runner
threads, cache performance of these threads has significant impact on overall thread
performance. Hence, this class of co-runner threads has a significance in our cache-aware
adaptive scheduling for experiment purposes, and even further it can be considered as an
application domain for our cache-aware adaptive framework.
2. Lightweight Co-Runner Threads This pair of threads refers to a pair threads, which
is composed of two different threads with different cache requirements. While the first
thread of the pair shows significant cache demand, the second in the pair does not have
a significant cache resource demand and might show high temporal locality. As a result,
second thread cache performance hardly impact the first thread and the overall system
performance.
To sum up, controlled constraints actually define the system performance under variant
conditions on a particular platform. In this case, it is possible to have a more reasonable analysis
opportunity.
Fixed Constraints
Fixed constraints refer to the invariant factors throughout the experiment. In our experiment,
multi-core architecture elements such as cache replacement policy, CPU architectures such as
simple scalar and super-scalar architecture, cache memory size and many other architectural
elements are fixed constraints. Despite the fact that these factors might also have a different
impacts on the processor performance of the different threads, these impacts are omitted due
to the fact that our research investigates the negative impact of co-runner dependent cache
resources allocation on the thread processing efficiency rather than architectural differences and
their impacts on processor performance. Apart from processor architecture, there might also be
compiler related factors involved in the experiment output. Since our experimental approach
is an endeavor to provide a fair global comparison among existent scheduling approaches, it is
inevitable to use some standard experimental setup. In our case, SPEC CPU2000 benchmarking
setup is preferred. SPEC CPU2000 benchmarking setup in fact rules out discrepancies due
to the minor difference among compiler structures, software thread switching, or to sum up,
software platform related differences. SPEC CPU2000 benchmarking setup includes a set of
6.3. EXPERIMENTS 117
argument files, binary source code. Indeed, benchmarking is a controlled experiment strategy,
which allows experimenter to have one-to-one comparison with other experiment outputs.
All in all, our aim as experimenters here is to provide a much clearer and simpler picture by
fixing some constraints and considering only a few main constraints. In addition, benchmarking
strategies also enhance our capabilities in analyzing and comparing our experiment outputs with
the conventional algorithms.
6.2.2 Design Strategy:Two Stage Experiment
In this section, we are going to provide a series of different scenarios for our experiment
setup. Once again it is worth mentioning that benchmarking tools are utilized for the sake
of consistency, better analysis and comparison with existing approaches and algorithms. Our
experiment is a two-stage experiment as shown in Figure 6.1; that is, statistics for a thread
and its co-runner pair such as miss count and instruction count, which are retrieved using the
M-Sim simulator, are inputs to the MATLAB computing platform. Namely, while M-SIM
is a multi-core multiprocessor simulator generating cache and instruction statistics, adaptive
control framework develop on MATLAB computing platform is utilized in order to analyze the
performance of the overall adaptive scheduling framework.
M-SIM M-SIM, with an additional time-series statistic collector module, generates time series
statistics of cache miss and instruction count metrics of a thread and the co-runner threads.
Adaptive Self-Tuning Framework (MATLAB) For collected time-series statistics, adaptive
self-tuning framework enforces fair allocation of instruction count to a thread by adjusting
processor cycles of co-runner threads. Experiment outputs validate the efficiency of the
overall framework, as in how accurate and robust the overall framework for a given thread
pair.
6.3 Experiments
In our research, considering constraints and strategies referred above, a series of experiments is
conducted. These experiments aim to assess the performance of the cache aware adaptive closed
loop scheduling framework while underlining advantages and drawbacks of the framework with
118 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
different sets of constraints. This section is outlined in three subsections: experiment setup,
experiment outcome and analysis.
6.3.1 Experiment Setup
In this experiment, the cache aware adaptive closed loop scheduling framework, which is com-
posed of an adaptive self tuning control framework and multi-core multiprocessor scheduling
framework, is considered as a target scheduling framework. For a fixed set of multiprocessor
constraints and controlled co-runner pairs, our experiment setup assesses the performance of
the adaptive scheduling framework.
Firstly, multi-core multiprocessor platform related constraints are specified as in Table 6.1,
and for that particular platform configuration and the selected co-runner thread pair, time-series
miss count and instruction count statistics are retrieved using the M-SIM simulation platform.
Table 6.1 shows only a limited number of constraints, which have a direct impact on the
performance of the cache aware adaptive multiprocessor scheduling framework. The M-SIM
simulation gives time-series statistics of co-runner pair thread; more specifically, instruction
count vector (ic[k]) of the thread, miss count vector (mc[k]) of the thread, miss count vector
(cmc[k]) of the co-runner pair, and instruction count (icCache−Dedicated[k]) of the thread in a
dedicated cache platform, as shown in Figure 6.1.
As collected time series vectors are imported into MATLAB workspace, MATLAB m-files
are executed according to the logic given in Figure 6.2. For an imported time series statistics,
main module mainSTRPP2.m calls initdef.m function to initialize initial system variables. After
the initialization of system variables, plant parameters are recursively identified using QR-RLS
module identnorder.m function and corresponding controller parameters are designed using pole
placement module PolePP.m function; these two module are called by main module recursively
till the end of a simulation run. Meanwhile, plant parameters and controller parameters are
recursively updated and recorded. By this way, system performance indicators are continuously
monitored. These system performance indicators are analyzed in the following section, Ex-
periment Outcomes. Modules mainSTRPP2.m, identnorder.m and PolePP.m in Figure 6.2 are
in fact implementations of cache aware adaptive scheduling framework described in Chapter
3, QR Recursive Least Square online deterministic algorithm given in Chapter 4 and algebraic
pole placement method given in Chapter 5, respectively.
6.3. EXPERIMENTS 119
Table 6.1: Multi-Core CMP Architecture Constraints
Multiprocessor Element Constraints Specifications
L2 Data Cache
Number of Sets 512Block Size 64
Associativity 16Replacement Policy LRU
L2Penalty(L3Latency+PhyMemLatency) 30+60=90 cycles
L1 Data Cache
Number of Sets 512Block Size 64
Associativity 4Replacement Policy LRU
L1 Instruction Cache
Number of Sets 512Block Size 64
Associativity 2Replacement Policy LRU
SMT Multi-Core Processor
Number of Cores 4Max Context(Thread) Per Core 3
Superscalar Architecture(Instruction BW) 8 instructions/cycle
Figure 6.1: Experiment Setup
120 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
Figure 6.2: MATLAB Framework Simulation Flow
6.3.2 Experiment Outcomes
According to the experiment setup given in the previous section, for each SPEC CPU2000
benchmarking thread pairs given in Table 6.3, the performance indicators of the simulation
outcomes of the cache aware adaptive scheduling framework are derived.
The SPEC CPU2000 benchmarking suite offers a number of workloads, each of which
has different memory access characteristics and execution patterns. For a chosen co-runner
thread pair in Table 6.2, multi-core CMP and adaptive scheduling framework simulations are
conducted. The simulations outcomes are summarized in Table 6.3 per co-runner thread pairs.
In the experiment, we have defined two classes of co-runner threads pairs, heavy-weight and
light-weight. As mentioned previously, while light-weight co-runner pairs demonstrate com-
parable less cache dependent performance degradation, each of the heavy-weight pairs shows
high cache requirement; and as a result, the cache dependent performance is inevitable for this
pair. Relevant to our experiment, the first thread of each case always refers to the thread with
high cache requirement, such as mcf,gcc, applu and mesa. However, the second pair is selected
as one heavy-weight thread and the following light-weight thread. In this way, for both light-
weight and heavy-weight co-runner pairs performance variability of the adaptive scheduling is
investigated. Table 6.2 describes benchmark threads or workloads, and their applications.
Our experiments reveal that for selected co-runner pairs, it is possible to improve the average
instruction count of the specified thread so that any cache memory related processing delay is
compensated. The achieved improvement refers to how close the cache aware adaptive closed
6.3. EXPERIMENTS 121
Table 6.2: Workload/Thread Definition & Scope [KleinOsowski and Lilja, 2002]
Thread Definition Type Componentart art is image recognition/neural network thread. Floating Point
mcf mcf is a combinatorial optimization thread. Integergcc gcc is a C compiler. Integer
applu applu is parabolic/elliptic partial differential equations solver Floating Pointmesa mesa is 3D graphic library thread. Floating Pointvpr FPGA circuit placement and routing thread Integer
crafty crafty is gaming, particularly chess, thread. Integer
Table 6.3: Adaptive Scheduling Framework Experiments Results
Co-RunnerPairs
AverageMissCount(Convention-alScheduler)
AverageIC(Convention-alScheduler)
AverageˆMissCount
AverageIC(Adap-tiveScheduler)
AverageDedIC
mcf & art 35 331 48 551 654.54mcf & vpr 34 590 42 578 654.54gcc & art 5 271 4.50 375 615.55gcc & crafty 3.79 557 3.38 525 615.55applu & art 6 721 5.082 1138 1196applu & vpr 7 1137 6.04 1198 1196mesa & applu 23.82 397 32 933 757.15mesa & crafty 12 704 12.61 1135 757.15
122 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
loop scheduling framework brings the instruction count in the shared cache environment (multi-
ple threads) to instruction count in the dedicated cache environment (single thread). According
to Table 6.3, it is obvious that the framework ensures instruction count of the thread in a shared
cache environment converges to the instruction count in dedicated cache environment (multiple
threads) by allocating extra execution slot to the thread. However, as might be observed, the
instruction count increase for all threads is not same, and in some cases, there is even a small
degradation in instruction count of the thread. In fact, the actual instruction count distribution
with respect to time indicates the degree of persistency. The behavior of reference instruction
count and actual instruction count with respect to time impact the feasibility of the cache aware
adaptive closed loop scheduling framework and the experiment outcome accuracy. These issues
are referred in details in the Analysis and Evaluation of Experiment section.
6.3.3 Analysis and Evaluation of Experiments
In this section, detailed analysis of experiment outcome is conducted. Each benchmark result
given in the previous section is supported with time series analysis. The main objective is
to determine the feasibility, practicality, weakness and strength of the cache aware adaptive
closed loop scheduling framework. In this respect, instruction count and miss count time series
regression statistics of the conventional scheduler and the adaptive scheduling framework are
retrieved. Then, instruction count statistics are compared with reference to instruction count
statistics ,which is gathered for the same thread under a dedicated cache environment.
mcf-art Co-Runner Pair
This co-runner pair is identified as a heavy-weight co-runner pair. Hence, as it is stated in Table
6.3, high miss count and considerable instruction count difference between reference (dedicated
cache) and actual (shared cache) instruction counts is anticipated. In such a scenario, we expect
our adaptive framework is able to improve instruction count for the particular thread from co-
runner threads. In this case, mcf is the target thread out of this co-runner pair. In Figure 6.16,
cyan dots represents the actual instruction count data points as a result of conventional sched-
uler; whereas, magenta line plot is adaptive instruction count, which indicates mcf computing
performance in our adaptive scheduling framework. According to Figure 6.16, the instruction
count performance is increased significantly over a time. This also can be verified by Table 6.3.
6.3. EXPERIMENTS 123
Figure 6.3: Adaptive Instruction Count and Actual Instruction Count
Figure 6.4: Adaptive Instruction Count and Reference Instruction Count
As mentioned previously, our main goal in cache aware adaptive closed loop scheduling
framework is to drive instruction count of the target thread towards reference instruction count,
considered as a thread instruction count in a dedicated cache environment. Figure 6.17 indicates
that adaptive instruction count, red line plot, converges towards mean of reference instruction
124 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
statistics, which are cyan data points. This behavior can also be verified in statistics given in
Table 6.3.
Figure 6.5: Adaptive Miss Count and Actual Miss Count
Not only is the instruction count error but also the cache miss count considered for the
reference instruction count tracking. A controller action for an instruction tracking error is
reconstructed on each cycle based on the estimated cache miss statistics. It is worth mentioning
once again the relation between cache miss and instruction count in our framework. In our
framework, instruction count tracking error is compensated with number processor execution
cycles. In other words, the controller produces a controller output, processor execution cycles
corresponding to the instruction count tracking error. Instruction count is equal to the number of
cycles multiplied with cycle per instruction; and cycle per instruction is referring to ideal cycle
per instruction plus stalls due to our cache miss. Hence, in such a framework, the accuracy
of cache miss prediction is critical. Figure 6.5 illustrates the performance of this estimation
process. According to Figure 6.5, despite hikes and peaks in the transient state of the thread
execution period (approximately 150 cycles out of 1500 cycles), the estimation converges to the
actual miss count measurement distribution. Here, cyan points refer to the actual miss counts
while red plot is the miss count estimation. In fact, except 10 data points out of 1500, the
estimation looks reasonable. As a result, we can conclude that the actual processor cycle to
instruction count translation in our framework is accurate enough to reflect the actual system
performance.
6.3. EXPERIMENTS 125
Figure 6.6: Reference Instruction Count and Actual Instruction Count
So far, we have discussed the accuracy of our framework for a given co-runner thread pairs;
however, it is also necessary to consider the feasibility of the cache aware adaptive closed loop
scheduling framework for a given co-runner pairs. In this case, the reference instruction count
to actual instruction count can be an indicator. For an mcf-art pair, Figure 6.15 illustrates that
reference instruction count distribution, which refers to instruction count statistics in a dedicated
cache environment, is significantly larger than actual instruction count. This, in fact, points
out that there is a significant cache miss and as a result cache stall, which degrades cycles
per instruction and instruction count. Indeed, this analysis provides an initial guess as to the
applicability and feasibility of our framework for this co-runner pair.
mcf-vpr Co-Runner Pair
In this scenario, co-runner pair for the target thread mcf is changed with light-weight thread vpr.
This scenario underlines the performance limitations of the cache aware adaptive closed loop
scheduling framework for light-weight co-runner pairs. To start with, Figure 6.15 illustrates
the actual and reference instruction count time series distribution. In contrast to heavy-weight
co-runner instances given in Figure 6.15, actual instruction count is very close to the reference
instruction count. In other words, there is a significant limitation for our adaptive scheduling
framework since instruction tracking error is already close to zero. Hence, the framework has a
126 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
Figure 6.7: Reference Instruction Count and Actual Instruction Count
negligible control effort, so no major contribution. This fact can be validated through Table 6.3.
Figure 6.8: Adaptive Instruction Count and Actual Instruction Count
To be more specific, Figure 6.16 illustrates that in fact adaptive instruction count is equal to
and even smaller than the actual instruction count. The reason the instruction count is below the
actual one is considered as the result of the estimation/approximation process in our framework,
which in fact takes the mean of the actual process. In Figure 6.16, adaptive instruction count
6.3. EXPERIMENTS 127
plot is crossing through the middle of actual data points. Despite the fact that our adaptive
scheduling framework drives the adaptive instruction count toward the reference instruction
count as shown in Figure 6.17, in fact the small tracking error between the actual and the
reference instruction count decreases the control effort of the our cache aware adaptive closed
loop scheduling framework.
Figure 6.9: Adaptive Instruction Count and Reference Instruction Count
From Figure 6.18, miss count estimation performance of the adaptive scheduling framework
is satisfactory.
Another significant observation about these two cache pairs and target thread mcf is that
despite the fact cache miss counts of target thread mcf in both light-weight and heavy weight
co-runner pair are same, the actual instruction count is different. This in fact states that in the
heavy-weight co-runner case, both threads are cache demanding. In other words, the number
of cache slot allocation per thread is low. Hence, even the cache miss count is the same as the
light-weight co-runner pair, the instruction count is in fact limited by memory resources size
allocated per thread. This case also underlines the requirement of extra cache memory block
with additional processor cycle allocation.
In two cases so far discussed, the target thread instruction count and cache miss charac-
teristics for two different co-runner pairs, light-weight and heavy-weight are illustrated. In
other words, target thread performance is analyzed with respect to variant co-runner pairs, light
128 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
Figure 6.10: Adaptive Miss Count and Actual Miss Count
weight and heavy weight threads. The same conclusion and characteristics can be applied to the
other co-runner pairs given in Table 6.3. However, the mesa thread characteristics irrespective
of variant co-runner pairs looks unusual. To emphasize and investigate this, for a fixed light-
weight co-runner pair crafty, mesa and another target thread gcc instruction count and cache
miss characteristics are investigated.
mesa-crafty Co-Runner Pair
According to Table 6.2, mesa is a 3D library thread and crafty is a gamer thread, particularly
in artificial intelligence games. To begin with, Figure 6.15 indicates that the actual instruction
count statistics are very close to the reference instruction count due to the fact that crafty is a
light-weight co-runner pair. However, both actual instruction count and reference instruction
count characteristics of mesa indicate a distinct execution pattern, which has idle and active
periods as shown in Figure 6.15.
While in the active period significant amount of instruction count is observed, during passive
period no activity or significant drop in instruction count is observed. As a result of this
fast transitions between active and passive states, our algorithm fails to track the reference
instruction count, and actually produces a significantly larger adaptive instruction count as
shown in Figure 6.17. In summary, as a result of the rapid transition between passive state
6.3. EXPERIMENTS 129
Figure 6.11: Reference Instruction Count and Actual Instruction Count
Figure 6.12: Adaptive Instruction Count and Actual Instruction Count
130 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
instruction count and active state instruction count of the thread mesa as shown in Figure
6.16, the adaptive instruction count is aggressively increased by the framework. This leads
an extra instruction allocation around 25% greater than the reference instruction count. As for
Figure 6.13: Adaptive Instruction Count and Reference Instruction Count
the cache miss count estimation, Figure 6.18 illustrates a similar characteristic as in previous
cases. Despite these few points, adaptive miss count or estimated miss count actually tracks the
actual miss count measurements.
In summary, due to the highly nonlinear instruction count pattern of the thread mesa, at some
vertical transition points, where derivative approaches infinity, cache aware adaptive closed
loop scheduling framework responds aggressively such that adaptive instruction jumps over
reference instruction count. This highly nonlinear transition in instruction count can best be
explained in the context of 3D rendering application as transition from steady visual elements
and context to highly active visual elements and texture. For instance, peak highway traffic
rendering could be an intensive load for mesa; whereas, rendering a scene or view has a very
low computational resource requirement. Hence, this scenario indicates that there may be a
case, where a framework might have tracking error up to 25%, and underlines the fact that the
framework performance is also dependent on input data persistency or characteristic, which in
our case is reference instruction count.
6.3. EXPERIMENTS 131
Figure 6.14: Adaptive Miss Count and Actual Miss Count
gcc-crafty Co-Runner Pair
In contrast to mesa-crafty, gcc-crafty has a similar instruction count characteristic as the rest of
the light-weight co-runner pairs. Due to the light-weight co-runner pair crafty, target thread gcc
reference instruction count distribution and actual instruction count distribution are quite close
to each other as shown in Figure 6.15. This indicates that the actual tracking error is close to
zero.
Due to the fact that the instruction count is close to zero, our adaptive scheduler effort
to drive actual instruction to reference instruction count is limited; as a result, the adaptive
instruction count is smaller or equal to the actual instruction count as shown in Figure 6.16.
This is also indicated by Table 6.3.
Despite the fact that cache aware adaptive closed loop scheduling framework tracking the
reference instruction count as shown in Figure 6.17, in this instance, due to the fact that actual
instruction count of the target thread is very close to the reference instruction count, our adaptive
scheduling framework effort is negligible. In other words, instruction count error, difference
between reference instruction count and actual instruction count, is close to zero; as a result, the
corresponding controller output, execution time slots (processor cycles), will be considerably
small. In such a circumstance, cache aware adaptive closed loop scheduling framework’s impact
132 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
Figure 6.15: Reference Instruction Count and Actual Instruction Count
Figure 6.16: Adaptive Instruction Count and Actual Instruction Count
6.4. CONCLUDING REMARKS 133
is negligible. As also mentioned previously in many instances, for a light-weighted co-runner
pair, our framework influence on the thread performance is limited.
Figure 6.17: Adaptive Instruction Count and Reference Instruction Count
As for the cache pattern (miss count) estimation, similar estimation capabilities are observed
as illustrated in Figure 6.18. In fact, this ensures that the cache aware adaptive closed loop
scheduling framework is not only tracking instruction count with respect to reference instruction
but also considering cache statistics during this process. In fact for a practical case, cache miss
is considered due to the fact that only processor cycles can be utilized as a controller output
by the processor; that is, for a calculated instruction count error, the corresponding processor
cycles are allocated through the thread cache access pattern. This is mainly because of the fact
that the processor itself is unaware of thread resource requirements until the thread is actually
executed. Thus, controller output should be an entity independent of thread execution behavior.
6.4 Concluding Remarks
In summary, our cache aware adaptive closed scheduling framework can be considered as an
enforcement mechanism, which ensures that target thread instruction count tracks reference
instruction count, which is in our case instruction count in a dedicated computing environment.
134 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
Figure 6.18: Adaptive Miss Count and Actual Miss Count
Apart from tracking capability, the framework is also able to consider cache pattern of the thread
during this process to provide a proactive control rather than a reactive one. For a calculated
tracking error, the controller calculates the number of processor cycles, controller output, which
is required to drive this error to zero. This process also considers the cache pattern of the target
thread. The instruction count tracking error to processor cycle conversion is necessary due to the
fact that the processor is only capable of allocating static resources rather than dynamic ones
such as instruction counts dependent on each thread cache characteristic. Here, our adaptive
scheduling controller performs this mapping based on the estimated cache pattern of the target
thread. Then, the controller output (processor cycle) is allocated by the processor based on
the request of the controller. Considering this framework, a series of experiments is conducted
using the experiment design pattern mentioned in the Section 6.1. These experiments indicate
three different cases for the variant of co-runner pair threads.
1. For a heavy-weight co-runner thread pairs, our adaptive scheduling is successful in im-
proving instruction count of the target thread towards the reference instruction count.
2. For a light-weight co-runner thread pair, our adaptive scheduling performance is depen-
dent on how close the actual instruction count is to the reference instruction count. In
this experiment, the reference instruction count is selected as a target thread instruction
count in a dedicated cache environment. In this case, our cache aware adaptive closed
6.4. CONCLUDING REMARKS 135
loop scheduling framework can be considered as a passive component in the processing
platform.
3. For a highly nonlinear and unstable target thread such as mesa, our scheduling framework
might have an error margin of up to 25 percent. In fact, this is completely dependent on
the co-runner pair instruction count statistics. Depending on persistency and consistency
of thread statistics, this error margin can be varied from 1 percent to 25 even 50 percent.
However, our experiment states that our cache aware adaptive closed loop scheduling
framework performs reasonably well for most of the threads.
As mentioned above, one of the significant contributions of this framework is in fact esti-
mating and considering thread cache patterns. This improves the robustness and practicability
of the framework. According to our experiment results, despite a small number of hikes and
jumps, cache pattern estimation is an efficient component in the overall framework. All in
all, experiment outcomes indicate that cache aware adaptive closed loop scheduling framework
is a reasonable solution, which considers cache pattern of the threads and compensates the
performance degradation due to inefficient cache allocation with an additional processor cycles
allocation.
136 CHAPTER 6. EXPERIMENTAL SETUP AND SIMULATION
Chapter 7
Conclusions and Recommendations
7.1 Summary
This thesis has successfully constructed a cache aware adaptive closed loop scheduling frame-
work, which in fact merges multidisciplinary research areas including the modern control theory
and computer systems. In line with our objectives and contributions stated in the introduction,
two major and three minor achievements can be underscored:
• First of all, the execution fairness algorithm has successfully been applied into the cache
aware adaptive scheduling framework. As previously stated in the introduction, this
algorithm has achieved an instruction count of actual thread to converge to cache ded-
icated (reference) instruction count. In this way, co-runner dependency on the thread
performance has been eliminated.
• A thread execution model has successfully been developed. This state-space model
has achieved time series formulation of coupled instruction count and cache resource
dynamics of the thread and co-runner threads. These sets of equations are in fact referring
to the thread execution and cache access patterns.
• Based on the developed thread execution model, QR Recursive Least Square (RLS)
parameter estimation algorithm has been successfully applied to estimate the instant cache
pattern of the thread. To achieve this, the regression cache miss model has been developed
using the time series statistics of miss count and co-runner miss count metrics. QR RLS
algorithm provides a set of polynomial coefficients representing the cache behaviors of
137
138 CHAPTER 7. CONCLUSIONS AND RECOMMENDATIONS
the thread at that particular time for the given statistics.
• Using estimated cache access and execution patterns of the thread, an algebraic controller
design algorithm (pole placement) has been successfully applied to the adaptive self-
tuning control framework, which enforces actual instruction count to track the cache
dedicated (reference) instruction count irrespective of cache patterns of co-runner and
actual thread. In the pole placement algebraic design algorithm, the designer has a
flexibility in determining the closed loop system response. In our case, stable cache
aware adaptive scheduling framework response has successfully been designed.
• The first four achievements can be considered as the theoretical foundation of the actual
adaptive self-tuning control framework design. However, the last achievement is the
development of the patch software module, which adds a few functional modules and
time series hit/miss counters to the M-Sim System Simulator.
Following the theoretical foundation of the cache aware adaptive closed loop scheduling
framework and the development of statistics retrieval tools, a number of experiments have been
conducted and the results are analyzed using SPEC CPU2000 benchmark threads. Based on
these observations, it is concluded that the framework is particularly efficient for co-runner
threads, which have high cache requirement, and relatively inefficient with co-runners with low
cache demands.
7.2 Future Work and Recommendations
Some potential research areas can be spotted on the basis of literature and research outcomes:
7.2.1 Heterogeneous Multiprocessor Architecture Resource Allocation Problem
In contrast homogeneous multiprocessor architecture, heterogeneous multiprocessor architec-
ture has a specific run-time fault handling, power management and computational resources
for each core. For instance, while one core has a low power consuming micro processor with
limited cache and processor resources, the other one can be a powerful graphical processing
core with high memory and processor resources. In such architecture, two important research
questions have been developed:
7.2. FUTURE WORK AND RECOMMENDATIONS 139
• How can threads be allocated among these heterogeneous cores such that maximum
resource efficiency can be achieved?
• What kinds of strategies can be developed to estimate thread resource requirements at run
time?
Due to the fact that specialization and corresponding resources of each cores are signifi-
cantly diversified, it is very critical to correctly allocate each thread to the suitable cores. Oth-
erwise, a significant degradation in thread and allocated core performance is inevitable. There
are two main approaches in core/resource allocation problems: static and dynamic approaches.
Hence, an efficient cache and resource management framework is essential. In other words, the
resource and role management on dynamic heterogeneous multicore processors (DHMPs) can
be a potential research field for future research work.
7.2.2 Statistical Pre-processing of Real-Time Statistical Information
Another challenge in any adaptive or dynamic framework is the estimation of the system state
based on the past measurements or states. Despite the fact statistical or deterministic estimation
methods have a straight forward approaches, there has been a significant number of measure-
ment samples with highly nonlinear and unpredictable patterns. This generally has a negative
impact on the estimation process efficiency, and indirectly on the overall system performance.
Hence, the pre-processing of statistics prior to forming a conclusion about the system state
is necessary to remove measurement samples, which are incoherent. In fact, this elimination
process ensures system stability and accuracy. In this field of research, there has been a vast
range of methods ranging from a simple filter to highly complicated data-mining algorithms.
7.2.3 Robust Adaptive Control Theory
Another potential research field, which is relevant to our research project, is the robust control
theory, which can be applied to increase the robustness of the system by defining error margins
(bounds). This analysis can be applied to degrade system control and estimation cost, especially
for adaptive self-learning systems. Namely, within the given error range, the system is consid-
ered having a fixed response even if there are a significant oscillations over that range. As a
result, adaptive self-learning cost can be reduced significantly. However, the derivation of these
140 CHAPTER 7. CONCLUSIONS AND RECOMMENDATIONS
bounds requires a significant mathematical and system analysis. Hence, computational cost of
our cache aware adaptive closed loop scheduling framework can be decreased by using these
mathematical tools.
7.2.4 Theoretical Analysis of the Scheduling Framework
As stated in previous chapters Chapter 4 and Chapter 5, algorithmic complexities of QR-RLS
and pole placement controller design algorithms are Θ(n2) and Θ(n3), respectively. Based on
this fact, the overall framework’s computational complexity is calculated as max(Θ(n3),Θ(n2))
= Θ(n3). Here, n refers to the system order; i.e the number of past measurement samples used
in derivation of current system states. In the thesis, discussions on the theoretical analysis of
overall framework and individual algorithms are not comprehensive due to the limited scope
and time frame of our Master research project. A potential future work is a comprehensive
theoretical analysis of these proposed algorithms; algebraic pole placement, QR Recursive Least
Square (RLS). This analysis includes complexity and stability analysis of the algorithms as well
as the overall scheduling framework.
QR-RLS Algorithm The Fast QR-RLS algorithm is a potential algorithm, which significantly
improves the complexity of online learning or parameter identification module from Θ(n2)
to Θ(n).
Pole Placement Algorithm The Pole placement algorithm has a high computational complex-
ity compared to the QR-RLS algorithm; hence, the placement algorithm can be consid-
ered as a computational bottleneck of the overall framework. Further theoretical analysis
and different computational approaches are necessary to reduce the complexity of the
pole placement algorithm.
Literature Cited
Anderson, J. H. and Srinivasan, A. (1999). A new look at pfair priorities. Technical report, In
Submission.
Anderson, J. H. and Srinivasan, A. (2000). Early release fair scheduling. 12th Euromicro
Conference on Real-Time Systems, 0:35.
Anthes, G. H. (2000). Cache memory. Computerworld, 34(14):62.
Apolinario, J. A. J. and Miranda, M. D. (2009). QRD-RLS Adaptive Filtering. Springer
Science+Business Media, LLC.
Astrom, K. J. and Wittenmark, B. (1994). Adaptive Control. Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.
Baruah, S., Gehrke, J., and Plaxton, C. (1995). Fast scheduling of periodic tasks on multiple
resources. In Proceedings 9th International Parallel Processing Symposium, 1995, pages 280
–288.
Baruah, S. K., Cohen, N. K., Plaxton, C. G., and Varvel, D. A. (1996). Proportionate progress:
A notion of fairness in resource allocation. Algorithmica, 15:600–625.
Bertogna, M., Cirinei, M., and Lipari, G. (2009). Schedulability analysis of global scheduling
algorithms on multiprocessor platforms. IEEE Trans. Parallel Distrib. Syst., 20(4):553–566.
Bobal, V., Bohm, J., and amd Jiri Machacek, J. F. (2005). Digital Self-tuning Controllers.
Springer-Verlag.
Bohlin, T. (2006). Practical Grey-box Process Identification: Theory and Applications
(Advances in Industrial Control). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
141
142 LITERATURE CITED
Brinkschulte, U. and Pacher, M. (2008). A control theory approach to improve the real-time
capability of multi-threaded microprocessors. In ISORC ’08: Proceedings of the 2008 11th
IEEE Symposium on Object Oriented Real-Time Distributed Computing, pages 399–404,
Washington, DC, USA. IEEE Computer Society.
Camacho, E. F. and Bordons, C. (1999). Model Predictive Control. Springer-Verlag.
Corporation, M. (2003). Msdn library. ASP.net.
Cottet, F., Delacroix, J., Kaiser, C., and Mammeri, Z. (2002). Scheduling in Real-Time Systems.
John Wiley & Sons, Chichester.
Diniz, P. S. (2008). Adaptive Filtering Algorithms and Practical Implementation. Springer
Science+Business Media, LLC.
Ebrahimi, E., Lee, C. J., Mutlu, O., and Patt, Y. N. (2010). Fairness via source throttling:
a configurable and high-performance fairness substrate for multi-core memory systems. In
ASPLOS ’10: Proceedings of the fifteenth edition of ASPLOS on Architectural support for
programming languages and operating systems, pages 335–346, New York, NY, USA. ACM.
Elliott, P. D. (2009). Bilinear Control Systems Matrices in Action, volume 169 of Applied
Mathematical Sciences. Springer Science+Business Media.
Fedorova, A., Seltzer, M., and Smith, M. D. (2006). Cache-fair thread scheduling for multicore
processors. Technical report, Harvard University.
Green, M. and Limebeer, D. J. N. (1995). Linear robust control. Prentice-Hall, Inc., Upper
Saddle River, NJ, USA.
Intel (2006). Dual-Core Update to the Intel Itanium 2 Processor Reference Manual. Intel
Corporation.
Ioannou, P. and Fidan, B. (2006). Adaptive Control Tutorial, volume 11 of Advances in Design
and Control. Society for Industrial and Applied Mathematics.
Itzkovitz, A., Schuster, A., and Shalev, L. (1998). Thread migration and its applications in
distributed shared memory systems. Journal of Systems and Software, 42(1):71 – 87.
LITERATURE CITED 143
Jahre, M. and Natvig, L. (2009). A light-weight fairness mechanism for chip multiprocessor
memory systems. In CF ’09: Proceedings of the 6th ACM conference on Computing frontiers,
pages 1–10, New York, NY, USA. ACM.
Jain, R., Chiu, D.-M., and Hawe, W. (1998). A quantitative measure of fairness and
discrimination for resource allocation in shared computer systems. CoRR, cs.NI/9809099:1–
32.
Kent, A. and Williams, J. G. (1997). Encyclopedia of Computer Science and Technology:
Supplement 21, volume 36 of Encyclopedia of Computer Science and Technology. CRC
Press.
Kim, S., Chandra, D., and Solihin, Y. (2004). Fair cache sharing and partitioning in a chip
multiprocessor architecture. In PACT ’04: Proceedings of the 13th International Conference
on Parallel Architectures and Compilation Techniques, pages 111–122, Washington, DC,
USA. IEEE Computer Society.
KleinOsowski, A. J. and Lilja, D. J. (2002). Minnespec: A new spec benchmark workload for
simulation-based computer architecture research. IEEE Comput. Archit. Lett., 1:7–.
Kroft, D. (1981). Lockup-free instruction fetch/prefetch cache organization. In ISCA ’81:
Proceedings of the 8th annual symposium on Computer Architecture, pages 81–87, Los
Alamitos, CA, USA. IEEE Computer Society Press.
Ljung, L. (1998). System Identification: Theory for the User (2nd Edition). Prentice Hall PTR.
Mak, P., Blake, M. A., Jones, C. C., Strait, G. E., and Turgeon, P. R. (1997). Shared-cache
clusters in a system with a fully shared memory. IBM J. Res. Dev., 41(4-5):429–448.
Matick, R. E., Heller, T. J., and Ignatowski, M. (2001). Analytical analysis of finite cache
penalty and cycles per instruction of a multiprocessor memory hierarchy using miss rates and
queuing theory. IBM J. Res. Dev., 45(6):819–842.
Microsystems, S. (2006). OpenSPARC T1 Microarchitecture Specication. Sun Microsystems,
SunMicrosystems, Inc., 4150Network Circle, Santa Clara, California 95054,U.S.A, first
edition.
144 LITERATURE CITED
Ogata, K. (1997). Modern control engineering (3rd ed.). Prentice-Hall, Inc., Upper Saddle
River, NJ, USA.
Olukotun, K. (2007). Chip Multiprocessor Architecture: Techniques to Improve Throughput
and Latency. Morgan and Claypool Publishers.
Paraskevopoulos, P. (2002). Modern Control Engineering. Control Series. Marcel Dekker, Inc.,
270 Madison Avenue, New York, NY 10016.
Pfenning, F. and Barbic, J. (2007). Multi-core architecture.
Siddha, S., Pallipadi, V., and Mallick, A. (2007). Process scheduling challenges in the era of
multi-core processors. Intel Technology Journal, 11(4):361–370.
SiliconGraphicsLibrary (2000). Origin2000 and onyx2 performance tuning and optimization
guide. HTML.
Srikantaiah, S., Kandemir, M., and Wang, Q. (2009). Sharp control: controlled shared cache
management in chip multiprocessors. In MICRO 42: Proceedings of the 42nd Annual
IEEE/ACM International Symposium on Microarchitecture, pages 517–528, New York, NY,
USA. ACM.
Stacpoole, R. and Jamil, T. (2000). Cache memories. Potentials, IEEE, 19(2):24 –29.
Stallings, W. (2003). Computer Organization and Architecture: Designing for Performance
(7th Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
Suh, G. E., Devadas, S., and Rudolph, L. (2001). Analytical cache models with applications
to cache partitioning. In ICS ’01: Proceedings of the 15th international conference on
Supercomputing, pages 1–12, New York, NY, USA. ACM.
Tam, D. K., Azimi, R., Soares, L. B., and Stumm, M. (2009). Rapidmrc: approximating l2 miss
rate curves on commodity systems for online optimizations. In ASPLOS ’09: Proceeding of
the 14th international conference on Architectural support for programming languages and
operating systems, pages 121–132, New York, NY, USA. ACM.
Tanenbaum, A. S. (2005). Structured Computer Organization (5th Edition). Prentice Hall, 5
edition.
LITERATURE CITED 145
Thimmannagari, C. (2008). Cpu Design: Answers to Frequently Asked Questions. Springer
Publishing Company, Incorporated.
Velusamy, S., Sankaranarayanan, K., Parikh, D., Abdelzaher, T., and Skadron, K. (2002).
Adaptive cache decay using formal feedback control. In In Proceedings of the 2002 Workshop
on Memory Performance Issues.
Vetter, S., Filhol, B., Kim, S., Linzmeier, G., and Plachy, O. (2006). Ibm system p5 quad-core
module based on power5+ technology. Technical overview and introduction, IBM Corp, BM
Corporation, International Technical Support Organization , Dept. JN9B Building 905, 11501
Burnet Road a,Austin, Texas 78758-3493 U.S.A.
Villani, P. (2001). Programming Win32 under the API. CMP Books, CMP Media, Inc.,
Publishers Group West, 1700 Fourth Street, Berkley, CA 94710.
Zhou, B., Qiao, J., and Lin, S. (2009a). Research on synthesis parameter real-time scheduling
algorithm on multi-core architecture. In CCDC’09: Proceedings of the 21st annual
international conference on Chinese control and decision conference, pages 5152–5156,
Piscataway, NJ, USA. IEEE Press.
Zhou, X., Chen, W., and Zheng, W. (2009b). Cache sharing management for performance
fairness in chip multiprocessors. In PACT ’09: Proceedings of the 2009 18th
International Conference on Parallel Architectures and Compilation Techniques, pages 384–
393, Washington, DC, USA. IEEE Computer Society.
146 LITERATURE CITED