Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
TEL-AVIV UNIVERSITY
RAYMOND AND BEVERLY SACKLER
FACULTY OF EXACT SCIENCES
SCHOOL OF COMPUTER SCIENCE
Joint Cache Partition and Job Assignmenton Multi-Core Processors
This thesis is submitted in partial fulfillment of the requirements for the M.Sc. degree
in the School of Computer Science, Tel-Aviv University
by
Omry Tuval
The research work for this thesis has been carried out at Tel-Aviv University
under the supervision of Prof. Haim Kaplan
January 2013
Acknowledgements
I would like to thank my advisor Prof. Haim Kaplan for the countless hours of invaluable
discussions and work that went into this research and thesis. I am honored to have worked with
him and thank him greatly for all the encouragement along the journey. It has been a wonderful
experience.
I would like to thank Dr. Avinatan Hassidim for numerous ideas and amazing intuitions.
Without his help this thesis would not come to be. Above all, I thank him for being a true
friend.
Finally, I would like to thank my mother, Myriam Tuval, who supported and encouraged
me in taking time off to do this research. I strive and hope to be half the person that she is.
i
Abstract
Multicore shared cache processors pose a challenge for designers of embedded systems who try
to achieve minimal and predictable execution time of workloads consisting of several jobs. One
way in which this challenge was addressed is by statically partitioning the cache among the
cores and assigning the jobs to the cores with the goal of minimizing the makespan. Several
heuristic algorithms have been proposed that jointly decide how to partition the cache among
the cores and how to assign the jobs. We initiate a theoretical study of this problem which we
call the joint cache partition and job assignment problem.
In this problem the input is a set of jobs, where each job is specified by a function that
gives the running time of the job for each possible cache allocation. The goal is to statically
partition a cache of size K among c cores and assign each job to a core such that the makespan
is minimized.
By a careful analysis of the space of possible cache partitions we obtain a constant ap-
proximation algorithm for this problem. We give better approximation algorithms for a few
important special cases. We also provide lower and upper bounds on the improvement that can
be obtained by allowing dynamic cache partitions and dynamic job assignments.
We show that our joint cache partition and job assignment problem generalizes an interesting
special case of the problem of scheduling on unrelated machines that is still more general than
scheduling on related machines. In this special case the machines are ordered by their ”strength”
and the running time of each job decreases when it is scheduled on a stronger machine. We call
this problem the ordered unrelated machines scheduling problem. We give a polynomial time
algorithm for scheduling on ordered unrelated machines for instances where each job has only
two possible load values and the sets of load values for all jobs are of constant size.
ii
Contents
Abstract ii
1 Introduction 1
1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Motivating Practical Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Machine Scheduling Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Single Core Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Multi Core Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 The ordered unrelated machines problem 10
3 Joint cache partition and job assignment 14
3.1 A constant approximation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Single load and minimal cache demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 2-approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 32 -approximation with 2K cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 43 -approximation with 3K cache, using dominant matching . . . . . . . . . . . . . 22
3.2.4 Approximate optimization algorithms for the single load, minimal cache model . . 25
3.2.5 Dominant perfect matching in threshold graphs . . . . . . . . . . . . . . . . . . . . 26
3.2.6 PTAS for jobs with correlative single load and minimal cache demand . . . . . . . 29
3.3 Step functions with a constant number of load types . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 The corresponding special case of ordered unrelated machines . . . . . . . . . . . . 37
3.4 Joint dynamic cache partition and job scheduling . . . . . . . . . . . . . . . . . . . . . . . 40
4 Static partitions under bijective analysis 45
5 Cache partitions in the speed-aware model 52
5.1 Finding the optimal static partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Variable cache partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
iii
CONTENTS iv
Bibliography 58
Chapter 1
Introduction
We study the problem of assigning n jobs to c cores on a multi-core processor, and simultanously
partitioning a shared cache of size K among the cores. Each job j is given by a non-increasing
function Tj(x) indicating the running time of job j on a core with cache of size x. A solution is a
cache partition p, assigning p(i) cache to each core i, and a job assignment S assigning each job
j to core S(j). The total cache allocated to the cores in the solution is K, that isc∑i=1
p(i) = K.
The makespan of a cache partition p and a job assignment S is maxi∑
j|S(j)=i Tj(p(i)). Our
goal is to find a cache partition and a job assignment that minimize the makespan.
Multi-core processors are the prevalent computational architecture used today in PC’s, mo-
bile devices and high performance computing. Having multiple cores running jobs concurrently,
while sharing the same level 2 and/or level 3 cache, results in complex interactions between the
jobs, thereby posing a significant challenge in determining the makespan of a set of jobs. Cache
partitioning has emerged as a technique to increase run time predictability and increase perfor-
mance on multi-core processors [LLD+08, MCHvE06]. Theoretic research on online multi-core
caching shows that the cache partition (which may be dynamic) has more influence on the per-
formance than the eviction policy [Has10, LOS12]. To obtain effective cache partitions, methods
have been developed to estimate the running time of jobs as a function of allocated cache, that
is the functions Tj(x) (see for example the cache locking technique of [LZLX10]).
Recent empirical research [LZLX11, LLX12] suggests that jointly solving for the cache par-
tition and for the job assignment leads to significant improvements over combining separate
algorithms for the two problems. The papers [LZLX11, LLX12] suggest and test heuristic al-
gorithms for the joint cache partition and job assignment problem. Our work initiates the
theoretic study of this problem.
1
CHAPTER 1. INTRODUCTION 2
We study this problem in the context of multi-core caching, but our formulation and results
are applicable in a more general setting, where the running time of a job depends on the
availability of some shared resource (cache, CPU, RAM, budget, etc.) that is allocated to the
machines. This setting is applicable, for example, for users of a public cloud infrastructure like
Amazon’s Elastic Cloud. When a user decides on her public cloud setup, there is usually a
limited resource (e.g. budget), that can be spent on different machines in the cloud. The more
budget is spent on a machine, it runs jobs faster and the user is interested in minimizing the
makespan of its set of jobs, while staying within the given budget.
We also study the problem of cache partitioning for multi-core caching in more classical
settings, where each core is already assigned a specific job to run, and our algorithms only
determine the cache partition. In Chapter 4 we consider the model in which each core has
a sequence of page requests to serve. We assume that an algorithm consists of a static cache
partition and an eviction policy and the goal is to find an algorithm that minimizes the maximal
number of cache misses by any core. Assuming all inputs are possible and using bijective analysis
we characterize the optimal algorithm.
In Chapter 5 we study the speed aware multi-core caching problem in which each core is
specified by a speed function vi(x) indicating the speed of core i if it is allocated x cache pages
and we want to partition the cache such that the speed of the slowest core is maximized. We give
algorithms that find the optimal static cache partition and the optimal variable cache partition.
1.1 Our Contribution
We show that the joint cache partition and job assignment problem is related to an interesting
special case of scheduling on unrelated machines that we call the Ordered Unrelated Machines
scheduling problem. In this problem there is a total order on the machines which captures
their relative strength. Each job has a different running time on each machine and these
running times are non-increasing with the strength of the machine. In Chapter 2 we define this
scheduling problem and show a general reduction and an approximation preserving reduction
from scheduling on ordered unrelated machines to the joint cache partition and job assignment
problem.
We present a 36-approximation algorithm for the joint cache partition and job assignment
problem in Section 3.1. We obtain this algorithm by showing that it suffices to consider a subset
CHAPTER 1. INTRODUCTION 3
of polynomial size of the cache partitions.
We obtain better approximation guarantees for special cases of the joint cache partition
and job assignment problem. When each job has a fixed running time aj and a minimal
cache demand xj , we present, in Section 3.2, a 2-approximation algorithm, a 32 -approximation
algorithm that uses 2K cache and a 43 -approximation algorithm that uses 3K cache. We call this
special case the single load minimal cache demand problem. Our 43 -approximation algorithm
is based on an algorithm that finds a dominant perfect matching in a threshold graph that
has a perfect matching, presented in Section 3.2.5. This algorithm and the existence of such a
matching in such a threshold graph are of independent interest.
We present a polynomial time approximation scheme for the single load minimal cache
demand problem, in the case where the jobs’ loads and cache demands are correlative, that is
aj ≤ aj′ iff xj ≤ xj′ (Section 3.2.6).
We study, in Section 3.3, the case where the load functions of the jobs, Tj(x), are step
functions. That is, job j takes lj time to run if given at least xj cache, and otherwise it takes
hj > lj . We show that if there are a constant number of different lj ’s and hj ’s then the problem
can be reduced to the single load minimal cache demand problem and thereby obtain the same
approximation results as for that problem (Section 3.3). We further show that if we consider
the special case of scheduling on ordered unrelated machine that corresponds to this model,
then there is dynamic programming algorithm that finds the optimal schedule in polynomial
time (Section 3.3.1).
In Section 3.4 we generalize the joint cache partition and job assignment problem and
consider dynamic cache partitions and dynamic job schedules. We show upper and lower bounds
on the makespan improvement that can be gained by using dynamic partitions and dynamic
assignments.
In Chapter 4 we use bijective analysis to prove that if the cores are already assigned specific
jobs and if the cores request pages from disjoint sets of the same size, then any algorithm that
partitions the cache equally among the cores is at least as good as any algorithm that uses
another static partition, regardless of the eviction policies used by the algorithms. Our proof is
based on showing a majorization theorem for the cumulative distribution function of a random
variable that is the maximum of several binomial random variables. We also present some
observations on the problem when cores request pages from sets of different sizes.
For the speed aware model, our main contribution (Chapter 5) is defining a variable cache
CHAPTER 1. INTRODUCTION 4
partition and presenting a linear program, of exponential size, for finding the optimal variable
cache partition. We then provide a separation oracle [GLS81] for this linear program and thus
obtain a polynomial time algorithm for finding the optimal variable cache partition using the
ellipsoid method [Kha79].
1.2 Related Work
1.2.1 Motivating Practical Work
Cache locking emerged in the real time embedded systems community as a technique to reduce
the unpredictability that is introduced by paging to the running time of computational jobs.
The term cache locking refers to a technique in which a job is analyzed in advance to select
instructions and data to lock in the cache in order to minimize the job’s worst case execution
time. Falk et al [FPT07] describe a heuristic greedy algorithm that picks a subset of the
functions to lock in the cache. As long as some function can fit in the remaining free cache it
picks the function that if locked would generate the largest decrease in execution time and marks
it for locking. For a comparative survey of instruction cache locking heuristics, see [CPIM05].
Vera et al present an algorithm for data cache locking [VLX03] that statically analyzes data
dependencies in the program to decide what data to lock and heuristically locks data that is
likely to be used, to augment the static analysis. Liu et al [LX09] formalize the problem of cache
locking and prove that it is NP-hard by a reduction from set cover. The paper also presents
optimal polynomial algorithms under the assumption that each function in the job’s code is
used only once and shows that previously best known heuristic ([FPT07]) for the cache locking
problem performs experimentally similar to their optimal algorithm for this special case.
The joint cache partition and job assignment problem is defined and studied empirically
by Liu et al in [LZLX11]. Cache-locking techniques are used in order to estimate the jobs’
execution time for each cache allocation. They use this mapping from cache allocation to
running time as the input for the joint cache partition and job assignment problem. The
main observation of this paper is that the job assignment and the cache partition influence
each other and therefore should be solved simultaneously. Their heuristic algorithm starts by
allocationg the same amount of cache to each core. It assigns the jobs to the cores, using
Graham’s algorithm ([Gra69]) for scheduling on identical machines, which is a 43 approximation
algorithm. Graham’s algorithm assigns the jobs, in a non-increasing order of their load, to the
CHAPTER 1. INTRODUCTION 5
currently least loaded core. Given the resulting job assignment, the algorithm computes the
optimal cache partition using a simple greedy algorithm (See Section 5.1). They further adjust
their solution by trying to move the smallest job currently on the most loaded core to the least
loaded core and recompute the optimal cache partition. If this improves the makespan, the
change is accepted and the the adjustment process is repeated. This is performed until the
adjustment is no longer beneficial or until a set number of iterations is reached. The paper also
provides an empirical study of the performance of this technique. The work described in this
thesis initiates a theoretical study of the joint cache partition and job assignment problem.
1.2.2 Machine Scheduling Theory
The joint cache partition and job assignment problem generalizes well known machine scheduling
problems where the objective function is the makespan.
In the problem of scheduling on unrelated machines, for each job j and each machine i
we are given T (i, j) which is the running time of job j on machine i. We want to find an
assignment of the jobs to the machines that minimizes the makespan. Lenstra et al. [LST90]
gave a 2-approximation algorithm for scheduling on unrelated machines that is based on a 2-
approximation algorithm for the decision problem. It formulates the decision problem as a linear
program, finds a vertex v of the feasible region if the problem is feasible, and then rounds it
to an integral solution with a makespan that is at most twice that of the fractional assignment
defined by v. The rounding is based on the fact that any vertex of the feasible region has at
most m+ |J | non zero variables and therefore at most m jobs that are not integrally assigned.
Consider any vertex of the linear program and consider a bipartite graph over the jobs and the
machines such that there is an edge between job j and machine i if a fraction of job j is assigned
to machine i, according to v. Lenstra et al. show how to find, in this graph and in polynomial
time, a matching that covers all the jobs. This matching defines an integral assignment of
makespan at most twice the makespan of the fractional solution defined by v. Lenstra et al also
show that it is NP-hard to approximate the unrelated machines scheduling problem to within
a factor better than 32 . This is based on showing that deciding whether an instance in which
T (i, j) ∈ 1, 2, for any i, j, has an assignment of makespan at most 2 is NP-Complete, by
a reduction from 3-dimensional-matching. Shchepin and Vakhania [SV05] improved Lenstra’s
rounding technique and obtained a 2− 1m approximation algorithm for unrelated machines.
The problem of scheduling on uniformly related machines is a special case of the unrelated
CHAPTER 1. INTRODUCTION 6
machines scheduling problem. In this problem, the input consists of a load lj for each job j and
a speed si for any machine i and we assume that machine i runs job j in timeljsi
. The goal
is again to assign the jobs to the machines such that the makespan in minimized. Hochbaum
and Shmoys [HS88] present a polynomial time approximation scheme for related machines, by
reducing the decision problem of scheduling on related machines to a bin packing problem with
bins of variable size. They give an (1+ε) approximation algorithm for this bin packing problem.
They also give a more practical 32 approximation algorithm.
Scheduling with processing set restrictions is a family of scheduling problems, where each
job j is allowed to be assigned to a given subset of the machines, denoted by Mj . Epstein and
Levin [EL11] show polynomial time approximation schemes for the following special cases of
scheduling with processing set restrictions:
• The nested model where for any two jobs j, j′ such that |Mj | ≤ |Mj′ |, either Mj ∩Mj′ = ∅
or Mj ⊂Mj′ and each job j has an identical load on all the machines in Mj .
• The tree hierarchical model where the machines are vertices of a given rooted tree and for
any job j, Mj is defined by a path starting from the root and not necessarily ending at a
leaf and each job j has an identical load on all the machines in Mj .
• The speed hierarchical model, in which we have an instance of the related machines
scheduling problem and each job j has a minimal speed requirement. A job can be
assigned only to machines that meet its minimal speed requirement.
Their PTASs for these problems are based on solving a rounded down instance using dynamic
programming. Our polynomial time algorithm for the special case of ordered unrelated ma-
chines, in Section 3.3.1, is inspired by these PTASs.
Bonficai and Wiese [BW12] consider a special case of unrelated machines in which there is
a constant number of different machine types. Each job may have different loads on machines
of different types but it has the same load for all machines of the same type. They present a
polynomial time approximation scheme for the problem. Their algorithm is based on classifying
jobs as large or small for each machine type. By rounding the loads of the large jobs they
are able to enumerate, in polynomial time, on a set of preallocated slots for large jobs on the
machines, without deciding the identity of the jobs in the slots. They assign the large jobs to
the slots and the small jobs to the remaining space by iteratively solving a series of related
linear programs.
CHAPTER 1. INTRODUCTION 7
Ebenlendr et al. [EKS08] gave a 74 approximation algorithm for the special case of scheduling
with processing set restrictions where each job can be assigned to at most two machines, and
it has the same load on both. They reformulate this problem as the following graph balancing
problem. Consider an undirected graph where the vertices correspond to the machines, and
the weight of each vertex is the sum of the loads of all jobs that must run on the machine
corresponding to this vertex. Each job j that can run on two machines corresponds to an edge
between the two corresponding vertices and its weight is the load of job j. Given an orientation
of the edges, they define the cost of a vertex v to be the sum of v’s weight and the weights of
all edges directed toward v. The goal is to direct the edges such that the maximum cost of any
vertex is minimized. They obtain the 74 -approximation algorithm by rounding a linear program
similar to Lenstra’s but with additional constraints the make sure that any vertex has at most
one edge of weight greater than 12 oriented toward it. Finally, they show that it is NP-hard
achieve an approximation ratio better than 32 for graph balancing.
1.2.3 Single Core Caching
In the classical paging problem we serve a single sequence of page requests. If a requested
page is in the cache when it is requested, it is a cache hit and otherwise it is a cache miss. In
case of a cache miss the requested page is fetched into the cache and we need to decide which
page in the cache to evict. The goal is to serve the sequence with a minimum number of cache
misses. Furthest-in-the-future [Bel66] is an optimal offline algorithm for the paging problem
that upon any cache miss, evicts the page in the cache whose next request is furthest in the
request sequence.
Most of the theoretic work on paging considers the online problem, where the request se-
quence is not given in advance and the algorithms’ performance is studied under the framework
of competitive analysis [ST85]. In competitive analysis, the performance of an online algorithm
is compared to the performance of the optimal offline algorithm. The competitive ratio of an
online algorithm is the maximal ratio between the cost of the online algorithm and the cost of
the offline optimal algorithm, for any input. That is, the competitive ratio of algorithm A is
maxσA(σ)
OPT (σ) where A(σ) is the number of misses by A on input σ, and OPT (σ) is the smallest
possible number of misses of any (offline) algorithm on σ.
In the same paper, Sleator and Tarjan [ST85] show that any deterministic online paging
algorithm has a competitive ratio of Ω(K) where K is the cache size. This lower bound follows
CHAPTER 1. INTRODUCTION 8
since an adversary can cause any deterministic online algorithm to have a cache miss on every
page request and on the other hand Furthest-in-the-future has at most 1 cache miss for every
K page requests. They further prove that the commonly used eviction policies Least-Recently-
Used (LRU) and First-In-First-Out (FIFO) are K-competitive. LRU is a special case of a wider
class of Marking algorithms that are all K-competitive [KMRS88]. The K-competitiveness of
these algorithms is proved by splitting the page requests sequence into phases such that the
optimal offline algorithm has at least one cache miss in each phase and any marking algorithm
has at most K cache misses in each phase. For more on competitive analysis of deterministic
and randomized online paging algorithms, see [Ira96].
Bijective analysis [ADLO07] was first introduced as an alternative way to analyze the per-
formance of online algorithms for the paging problem. Bijective analysis directly compares
between online algorithms, allowing it to better differentiate between different algorithms that
have the same competitive ratio. Online algorithm A is at least as good as online algorithm B,
under bijective analysis, for inputs of size n if there is a bijection π of the inputs of size n such
that A(π(σ)) ≤ B(σ). Note that bijective analysis considers the performance of the algorithms
for every possible input and not just the worst case.
In [AS09] bijective analysis is used to prove the optimality of LRU when locality of reference
is assumed. This provides a theoretic justification to the experimental results that show that
LRU performs in practice much better than some other marking algorithms.
1.2.4 Multi Core Caching
In the multicore caching problem, c cores share a cache of size K and each core has a separate
sequence of page requests of length n. When a core requests a page that is currently not in
the cache, this core is delayed by τ > 1 time units while this page is fetched into the cache
from main memory and some page currently in the cache (that may have been requested by
another core) is evicted to make room for the fetched page. While a core is fetching a page from
main memory, the other cores continue to advance through their request sequences. The goal
is to design an algorithm that decides, for each cache miss, which page currently in the cache
is evicted, such that the maximal number of cache misses by any core is minimized.
Much of the difficulty in designing competitive online algorithms for multi-core caching
stems from the fact that the way in which the request sequences of the different cores interleave
depends on the decisions of the algorithm. An algorithm with competitive ratio logarithmic in
CHAPTER 1. INTRODUCTION 9
the number of cores is obtained in [BGV00], for a different model in which the interleaving of
the request sequences is fixed.
Hassidim [Has10] considers the more realistic scenario in which the interleaving of the request
sequences does depend on the algorithm. He proves that even if we restrict the sequences of
different cores to be disjoint, then the offline problem is NP-hard. He further shows that if we
compare LRU with K cache to an optimal offline solution with Kα cache (a technique called
resource augmentation [ST85]), then LRU is Ω(τ/α) competitive. Note that an algorithm that
does not use the cache at all is Θ(τ) competitive.
The work in [Has10] also shows that whenever the optimal offline algorithm evicts a page of core
i it is the page that core i requests furthest in the future. This means that given the amount of
cache allocated to each core at each time, it is easy to decide (offline) which page to evict and
therefore the main challenge is to (dynamically) partition the cache between the cores.
Lopez-Ortiz and Salinger [LOS12] continued the work in the model presented in [Has10].
They show that online algorithms that use static cache partitions have a competitive ratio
of Ω(n) when compared to an optimal offline algorithm that uses dynamic cache partitions.
They also show that combining dynamic cache partitions with the traditional single-core online
eviction policies (like LRU) results in an arbitrarily large competitive ratio, as a function of
n. This paper criticizes [Has10] for considering algorithms that intentionally evict cache pages
and cause cache-misses that may lead to a more favourable interleaving of the cores’ request
sequences. The paper also defines an algorithm to be honest if it only evicts a page when
it incurs a cache miss. They show that for any offline algorithm, there is an honest offline
algorithm that is at least as good as the original algorithm. Finally, they also show that even
if a cache miss takes the same amount of time to serve as a cache hit, and thus does not affect
the sequences’ interleaving, then the offline problem of deciding whether a certain input can be
served such that each core does not have more cache misses than a given threshold per core is
NP-complete.
Chapter 2
The ordered unrelated machines
problem
The ordered unrelated machines scheduling problem is defined as follows. There are c machines
and a set J of jobs. The input is a matrix T (i, j) giving the running time of job j on machine i,
such that for each two machines i1 < i2 and any job j, T (i1, j) ≥ T (i2, j). The goal is to assign
the jobs to the machines such that the makespan is minimized.
The ordered unrelated machines scheduling problem is a special case of scheduling on un-
related machines in which there is a total order on the machines that captures their relative
strengths. This special case is natural since in many practical scenarios the machines have some
underlying notion of strength and jobs run faster on a stronger machine. For example a newer
computer typically dominates an older one in all parameters, or a more experienced employee
does any job faster than a new recruit.
Lenstra et al [LST90] gave a 2 approximation algorithm for scheduling on unrelated machines
based on rounding an optimal fractional solution to a linear program, and proved that it is NP-
hard to approximate the problem to within a factor better than 32 . It is currently an open
question if there are better approximation algorithms for ordered unrelated machines than the
more general algorithms that approximate unrelated machines.
Another well-studied scheduling problem is scheduling on uniformly related machines. In
this problem, the time it takes for machine i to run job j isljsi
where lj is the load of job j and
si is the speed of machine i. A polynomial time approximation scheme for related machines
is described in [HS88]. It is easy to see that the problem of scheduling on related machines is
a special case of the problem of scheduling on ordered unrelated machines, and therefore the
10
CHAPTER 2. THE ORDERED UNRELATED MACHINES PROBLEM 11
ordered unrelated machines problem is NP-hard.
The ordered unrelated machines problem is closely related to the joint cache partition and
job assignment problem. Consider an instance of the joint cache partition and job assignment
problem with c cores, K cache and a set of jobs J such that Tj(x) is the load function of job
j. If we fix the cache partition to be some arbitrary partition p, and we index the cores in
non-decreasing order of their cache allocation, then we get an instance of the ordered unrelated
machines problem, where T (i, j) = Tj(p(i)). Our constant approximation algorithm for the joint
cache partition and job assignment problem, described in Section 3.1, uses this observation as
well as Lenstra’s 2-approximation algorithm for unrelated machines. In the rest of this section
we prove that the joint cache partition and job assignment problem is at least as hard as the
ordered unrelated machines scheduling problem.
We reduce the ordered unrelated machine problem to the joint cache partition and job
assignment problem. Consider the decision version of the ordered unrelated scheduling problem,
with c machines and n = |J | jobs, where job j takes time T (i, j) to run on machine i. We want
to decide if it is possible to schedule the jobs on the machines with makespan at most M .
Define the following instance of the joint cache partition and job assignment problem. This
instance has c cores, a total cache K = c(c + 1)/2 and n′ = n + c jobs. The first n jobs
(1 ≤ j ≤ n) correspond to the jobs in the original ordered unrelated machines problem, and c
jobs are new jobs (n + 1 ≤ j ≤ n + c). The load function Tj(x) of job j, where 1 ≤ j ≤ n,
equals T (x, j) if x ≤ c and equals T (c, j) if x > c. The load function Tj(x) of job j, where
n + 1 ≤ j ≤ n + c, equals M + δ if x ≥ j − n for some δ > 0 and equals ∞ if x < j − n. Our
load functions Tj(x) are non-increasing because the original T (i, j)’s are non-increasing in the
machine index i.
Lemma 2.1. The makespan of the joint cache partition and job assignment instance defined
above is at most 2M+δ if and only if the makespan of the original unrelated scheduling problem
is at most M .
Proof. Assume there is an assignment S′ of the jobs in the original ordered unrelated machines
instance of makespan at most M . We show a cache partition p and job assignment S for the
joint cache partition and job assignment instance with makespan at most 2M + δ.
The cache partition p is defined such that p(i) = i for each core i. The partition p uses
exactly K = c(c + 1)/2 cache. The job assignment S is defined such that for a job j > n,
CHAPTER 2. THE ORDERED UNRELATED MACHINES PROBLEM 12
S(j) = j − n and for a job j ≤ n, S(j) = S′(j). The partition p assigns i cache to core i, which
is exactly enough for job n+ i, which is assigned to core i by S, to run in time M + δ. It is easy
to verify that p, S is a solution to the joint cache partition and job assignment instance with
makespan at most 2M + δ.
Assume there is a solution p, S for the joint cache partition and job assignment instance,
with makespan at most 2M + δ. Job j, such that n < j ≤ n+ c, must run on a core with cache
at least j−n, or else the makespan would be infinite. Moreover, no two jobs j1 > n and j2 > n
are assigned by S the same core, as this would give a makespan of at least 2M + 2δ. Combining
these observations with the fact that the total available cache is K = c(c + 1)/2, we get that
the cache partition must be p(i) = i for each core i. Furthermore, each job j > n is assigned
by S to core j − n and all the other jobs assigned by S to core j − n are jobs corresponding to
original jobs in the ordered unrelated machines instance. Therefore, the total load of original
jobs assigned by S to core i is at most M .
We define S′, a job assignment for the original ordered unrelated machines instance, by
setting S′(j) = S(j) for each j ≤ n. Since S assigns original jobs of total load at most M on
each core, it follows that the makespaen of S′ is at most M .
The following theorem follows immediately from Lemma 2.1
Theorem 2.2. There is a polynomial-time reduction from the ordered unrelated machines
scheduling problem to the joint cache partition and job assignment problem.
The reduction in the proof of Lemma 2.1 does not preserve approximation guarantees.
However by choosing δ carefully we can get the following result.
Theorem 2.3. Given an algorithm A for the joint cache partition and job assignment problem
that approximates the optimal makespan up to a factor of 1+ε, for 0 < ε < 1, we can construct an
algorithm for the ordered unrelated machines scheduling problem that approximates the optimal
makespan up to a factor of 1 + 2ε+ 2ε2
1−ε−χ for any χ > 0.
Proof. We first obtain a(
1 + 2ε+ 2ε2
1−ε−χ
)-approximation algorithm for the decision version of
the ordered unrelated machines scheduling problem. That is, an algorithm that given a value
M , either decides that there is no assignment of makespan M or finds an assignment with
makespan(
1 + 2ε+ 2ε2
1−ε−χ
)M .
Given an instance of the ordered unrelated machines scheduling problem, we construct an
instance of the joint cache partition and job assignment as described before lemma 2.1, and set
CHAPTER 2. THE ORDERED UNRELATED MACHINES PROBLEM 13
δ = 2εM1−ε−χ , for an arbitrarily small χ > 0. We use algorithm A to solve the resulting instance of
the joint cache partition and job assignment problem. Let p, S be the solution returned by A.
We define S′(j) = S(j) for each 1 ≤ j ≤ n. If the makespan of S′ is at most(
1 + 2ε+ 2ε2
1−ε−χ
)M
we return S′ as the solution and otherwise decide that there is no solution with makespan at
most M .
If the makespan of the original instance is at most M , then by lemma 2.1 there is a solution
to the joint cache partition and job assignment instance resulting from the reduction, with
makespan at most 2M+δ. Therefore p, S, the solution returned by algorithm A, is of makespan
at most (1 + ε)(2M + δ).
By our choice of δ we have that (1+ε)(2M+δ) < 2M+2δ and therefore each core is assigned
by S at most one job j, such that j > n. In addition, any job j such that n < j ≤ n+ c, must
run on a core with cache at least j − n, or else the makespan would be infinite. Combining
these observations with the fact that the total available cache is K = c(c + 1)/2, we get that
the cache partition must be p(i) = i for each core i. Furthermore, each job j > n is assigned
by S to core j − n and all the other jobs assigned by S to core j − n are jobs corresponding to
original jobs in the ordered unrelated machines instance. Therefore, the total load of original
jobs assigned by S to core i is at most (1 + ε)(2M + δ)−M − δ. It follows that the makespan
of S′ is at most (1 + ε)(2M + δ)−M − δ = M(
1 + 2ε+ 2ε2
1−ε−χ
).
We obtained a(
1 + 2ε+ 2ε2
1−ε−χ
)-approximation algorithm for the decision version of the or-
dered unrelated machines scheduling problem. In order to approximately solve the optimization
problem, we can perform a binary search for the optimal makespan using the approximation
algorithm for the decision version of the problem and get a(
1 + 2ε+ 2ε2
1−ε−χ
)-approximation
algorithm for the optimization problem. We obtain an initial search range for the binary search
by usingn∑j=1
T (c, j) as an upper bound on the makespan of the optimal schedule and 1c
n∑j=1
T (c, j)
as a lower bound. (See section 3.2.4 for a detailed discussion of a similar case of using an approx-
imate decision algorithm in a binary search framework to obtain an approximate optimization
algorithm.)
Chapter 3
Joint cache partition and job
assignment
3.1 A constant approximation algorithm
We first obtain an 18-approximation algorithm for the joint cache partition and job assignment
problem that uses (1 + 52ε)K cache for some constant 0 < ε < 1
2 . We then show another
algorithm that uses K cache and approximates the makespan up to a factor of 36.
Our first algorithm, denoted by A, enumerates over a subset of cache partitions, denoted
by P (K, c, ε). For each partition in this set A approximates the makespan of the corresponding
scheduling problem, using Lenstra’s algorithm, and returns the partition and associated job
assignment with the smallest makespan.
Let K ′ = (1 + ε)dlog1+ε(K)e, the smallest integral power of (1 + ε) which is at least K. The
set P (K, c, ε) contains cache partitions in which the cache allocated to each core is an integral
power of (1 + ε) and the number of different integral powers used by the partition is at most
log2(c). We denote by b the number of different cache sizes in a partition. Each core is allocated
K′
(1+ε)ljcache, where lj ∈ N and 1 ≤ j ≤ b. The smallest possible cache allocated to any core is
the smallest integral power of (1+ε) which is at least Kεc and the largest possible cache allocated
to a core is K ′. We denote by σj the number of cores with cache at least K′
(1+ε)lj. It follows that
there are (σj − σj−1) cores with K′
(1+ε)ljcache. We require that σj is an integral power of 2 and
14
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 15
that the total cache used is at most(1 + 5
2ε)K. Formally,
P (K, c, ε) = (l = < l1, . . . , lb >, σ =< σ0, σ1, . . . , σb >) | b ∈ N, 1 ≤ b ≤ log2 c (3.1)
∀j, lj ∈ N, 0 ≤ lj ≤ log1+ε
(cε
)+ 1, ∀j, lj+1 > lj (3.2)
∀j ∃uj ∈ N s.t. σj = 2uj , σ0 = 0, σb ≤ c, ∀j σj+1 > σj (3.3)
b∑j=1
(σj − σj−1)K ′
(1 + ε)lj≤(
1 +5
2ε
)K (3.4)
When the parameters are clear from the context, we use P to denote P (K, c, ε). Let M(p, S)
denote the makespan of cache partition p and job assignment S. The following theorem specifies
the main property of P , and is proven in the remainder of this section.
Theorem 3.1. Let p, S be any cache partition and job assignment. There are a cache partition
and a job assignment p, S such that p ∈ P and M(p, S) ≤ 9M(p, S).
An immediate corollary of Theorem 3.1 is that algorithm A described above finds a cache
partition and job assignment with makespan at most 18 times the optimal makespan.
Lemma 3.2 shows that A is a polynomial time algorithm.
Lemma 3.2. The size of P is polynomial in c.
Proof. Let (l, σ) ∈ P . The vector σ is a strictly increasing vector of integral powers of 2, where
each power is at most c. Therefore the number of possible vectors for σ is bounded by the
number of subsets of 20, . . . , 2log2(c) which is O(2log2 c) = O(c). The vector l is a strictly
increasing vector of integers, each integer is at most log1+ε(cε ) + 1. Therefore the number of
vectors l is bounded by the number of subsets of integers that are at most log1+ε(cε )+1 which is
O(2log1+ε(cε)) = O(2
log2(cε )
log2(1+ε) ) = Poly(c) since ε is a constant. Therefore |P | = O(c 2log2(
cε )
log2(1+ε) ).
Let (p, S) be a cache partition and a job assignment that use c cores, K cache and have
a makespan M(p, S). Define a cache partition p1 such that for each core i, if p(i) < Kεc then
p1(i) = Kεc and if p(i) ≥ Kε
c then p1(i) = p(i). For each core i, p1(i) ≤ p(i) + kεc and hence the
total amount of cache allocated by p1 is bounded by (1 + ε)K. For each core i, p1(i) ≥ p(i) and
therefore M(p1, S) ≤M(p, S).
Let p2 be a cache partition such that for each core i, p2(i) = (1+ ε)dlog1+ε(p1(i))e, the smallest
integral power of (1 + ε) that is at least p1(i). For each i, p2(i) ≥ p1(i) and thus M(p2, S) ≤
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 16
M(p1, S) ≤M(p, S). We increased the total cache allocated by at most a multiplicative factor
of (1 + ε) and therefore the total cache used by p2 is at most (1 + ε)2K ≤ (1 + 52ε)K since ε < 1
2 .
Let ϕ be any cache partition that allocates to each core an integral power of (1 + ε) cache.
We define the notion of cache levels. We say that core i is of cache level l in ϕ if ϕ(i) = K′
(1+ε)l.
Let cl(ϕ) denote the number of cores in cache level l in ϕ. The vector of cl’s, which we call the
cache levels vector of ϕ, defines the partition ϕ completely since any two partitions that have
the same cache level vector are identical up to a renaming of the cores.
Let σ(ϕ) be the vector of prefix sums of the cache levels vector of ϕ. Formally, σl(ϕ) =l∑
i=0ci(ϕ). Note that σl(ϕ) is the number of cores in cache partition ϕ with at least K′
(1+ε)lcache
and that for each l, σl(ϕ) ≤ c.
For each such cache partition ϕ, we define the significant cache levels li(ϕ) recursively as
follows. The first significant cache level l1(ϕ) is the first cache level l such that cl(ϕ) > 0.
Assume we already defined the i− 1 first significant cache levels and let l′ = li−1(ϕ) then li(ϕ)
is the smallest cache level l > l′ such that σl(ϕ) ≥ 2σl′(ϕ).
Lemma 3.3. Let lj and lj+1 be two consecutive significant cache levels of ϕ, then the total
number of cores in cache levels in between lj and lj+1 is at most σlj (ϕ). Let lb be the last
significant cache level of ϕ then the total number of cores in cache levels larger than lb is at
most σlb(ϕ).
Proof. Assume to the contrary thatlj+1−1∑f=lj+1
cf (ϕ) ≥ σlj (ϕ). This implies that for l′ = lj+1 − 1,
σl′(ϕ) ≥ 2σlj (ϕ) which contradicts the assumption that there are no significant cache levels in
between lj and lj+1 in ϕ. The proof of the second part of the lemma is analogous.
Let cl = cl(p2). For each core i, Kεc ≤ p2(i) ≤ K ′, so we get that if l is a cache level in p2
such that cl 6= 0 then 0 ≤ l ≤ log1+ε(cε ) + 1. Let σl = σl(p2) and σ =< σ1, . . . , σb′ >, where
b′ = log1+ε(cε ) + 1. Let li = li(p2), for 1 ≤ i ≤ b, where b is the number of significant cache
levels in p2.
We adjust p2 and S to create a new cache partition p3 and a new job assignment S3. Cache
partition p3 has cores only in the significant cache levels l1, . . . , lb of p2. We obtain p3 from p2
as follows. Let f be a non-significant cache level in p2. If there is a j such that lj−1 < f < lj
then we take the cf cores in cache level f in p2 and reduce their cache so they are now in cache
level lj in p3. If f > lb then we remove the cf cores at level f from our solution. It is easy to
check that the significant cache levels of p3 are the same as of p2, that is l1, . . . , lb. Since we
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 17
only reduce the cache allocated to some cores, the new cache partition p3 uses no more cache
than p2 which is at most (1 + 52ε)K.
We construct S3 by changing the assignment of the jobs assigned by S to cores in non-
significant cache levels in p2. As before, let f be a nonsignificant cache level and let lj−1 be the
maximal significant cache level such that lj−1 < f . For each core i in cache level f in p2 we
move all the jobs assigned by S to core i, to a target core in cache level lj−1 in p3. Lemma 3.4
specifies the key property of this job-reassignment.
Lemma 3.4. We can construct S3 such that each core in a significant level of p3 is the target
of the jobs from at most two cores in a nonsignificant level of p2.
Proof. Let c3 denote the cache levels vector of p3 and let σ3 denote the vector of prefix sums
of c3. From the definition of p3 follows that for all j, σ3lj = σlj , and that for j > 1, c3lj =
σ3lj − σ3lj−1
= σlj − σlj−1.
By Lemma 3.3 the number of cores in nonsignificant levels in p2 whose jobs are reassigned
to one of the c3lj cores in level lj in p3 is at most σlj . So for j > 1 the ratio between the number
of cores whose jobs are reassigned to the number of target cores in level lj in p3 is at most
σljσlj−σlj−1
= 1 +σlj−1
σlj−σlj−1≤ 2. For j = 1 the number of target cores in level l1 of p3 is c3l1 = σl1
which is at least as large as the number of cores at nonsignificant levels between l1 and l2 in p2
so we can reassign the jobs of a single core of a nonsignificant level between l1 and l2 in p2 to
each target core.
Corollary 3.5. M(p3, S3) ≤ 3M(p, S)
Proof. In the new cache partition p3 and job assignment S3 we have added to each core at a
significant level in p3 the jobs from at most 2 other cores at nonsignificant levels in p2. The
target core always has more cache than the original core, thus the added load from each original
core is at most M(p2, S). It follows that M(p3, S3) ≤ 3M(p2, S) ≤ 3M(p, S).
Let c3 denote the cache levels vector of p3 and let σ3 denote the vector of prefix sums of c3.
We now define another cache partition p based on p3. Let uj = blog2(σ3lj
)c. The partition p has
2u1 cores in cache level l1, and 2uj − 2uj−1 cores in cache level lj for 1 < j ≤ b. The cache levels
l1, . . . , lb are the significant cache levels of p and p has cores only in its significant cache levels.
Let clj denote the number of cores in the significant cache level lj in p.
Lemma 3.6. 3clj ≥ c3lj
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 18
Proof. By the definition of uj , we have that 2uj ≤ σ3lj < 2uj+1. So for j > 1
cljc3lj
=2uj − 2uj−1
σ3lj − σ3lj−1
>2uj − 2uj−1
2uj+1 − 2uj−1=
2uj−uj−1 − 1
2uj−uj−1+1 − 1(3.5)
Since lj and lj−1 are two consecutive significant cache levels we have that uj − uj−1 ≥ 1. The
ratio in 3.5 is an increasing function of uj −uj−1 and thus minimized by uj −uj−1 = 1, yielding
a lower bound of 13 . For j = 1,
cl1c3l1
= 2u1σ3l1
> 2u12u1+1 = 1
2 .
Lemma 3.6 shows that the cache partition p has in each cache level lj at least a third of the
cores that p3 has at level lj . Therefore, there exists a job assignment S that assigns to each
core of cache level lj in p the jobs that S3 assigns to at most 3 cores in cache level lj in p3.
We only moved jobs within the same cache level and thus their load remains the same, and the
makespan M(p, s) ≤ 3M(p3, S3) ≤ 9M(p, s).
Lemma 3.7. Cache partition p is in the set P (K, c, ε).
Proof. Let σ be the vector of prefix sums of c. The vectors < l1, . . . , lb >,< σl1 , . . . , σlb >
clearly satisfy properties 1-3 in the definition of P (K, c, ε). It remains to show that p uses at
most (1 + 52ε)K cache (property 4).
Consider the core with the xth largest cache in p. Let lj be the cache level of this core.
Thus σlj ≥ x. Since σlj is the result of rounding down σ3lj to the nearest integral power of 2,
we have that σlj ≤ σ3lj . It follows that σ3lj ≥ x and therefore the core with the xth largest cache
in p3 is in cache level lj or smaller and thus is it has at least as much cache as the xth largest
core in p. So p uses at most the same amount of cache as p3 which is at most (1 + 52ε)K.
This concludes the proof of Theorem 3.1, and establishes that our algorithm A is an 18-
approximation algorithm for the problem, using (1 + 52ε)K cache.
We provide a variation of algorithmA that uses at mostK cache, and finds a 36-approximation
for the optimal makespan. Algorithm B enumerates on r, 1 ≤ r ≤ K, the amount of cache
allocated to the first core. It then enumerates over the set of partitions P = P (K−r2 , d c2e−1, 25).
For each partition in P it adds another core with r cache and applies Lenstra’s approximation
algorithm on the resulting instance of the unrelated machines scheduling problem, to assign all
the jobs in J to the d c2e cores. Algorithm B returns the partition and assignment with the
minimal makespan it encounters.
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 19
Theorem 3.8. If there is a solution of makespan M that uses at most K cache and at most c
cores then algorithm B returns a solution of makespan 36M that uses at most K cache and at
most c cores.
Proof. Let (p, S) be the a solution of makespan M , K cache and c cores. W.l.o.g. assume that
the cores are indexed according to the non-increasing order of their cache allocation in this
solution, that is p(i+ 1) > p(i).
Let J ′ = j ∈ J | S(j) ≥ 3. Consider the following job assignment S′ of the jobs in J ′ to
the cores of odd indices greater than 1 in (p, S). The assignment S′ assigns to core 2i − 1, for
i ≥ 2, all the jobs that are assigned by S to cores 2i− 1 and 2i. Note that all the jobs assigned
by S′ to some core are assigned by S to a core with at most the same amount of cache and thus
the makespan of S′ is at most 2M .
Assume r = p(1). Then K = r +∑
odd i≥3p(i) + p(i − 1) ≥ r +
∑odd i≥3
2p(i) since p is non-
increasing. Therefore we get that∑
odd i≥3p(i) ≤ K−r
2 . Therefore we can assign the jobs in J ′ to
d c2e − 1 cores with a total cache of K−r2 , such that the makespan is at most 2M . By Theorem
3.1, there is a partition p′ ∈ P (K−r2 , d c2e − 1, 25) that allocates at most (1 + 5225)K−r2 = K − r
cache to d c2e − 1 cores, and a job assignment S′ of the jobs in J ′ to these cores such that the
makespan of p′, S′ is at most 18M .
Let p be a cache partition that adds to p′ another core (called “core 1”) with r cache. The
total cache used by p is at most K. Let S be a job assignment such that S(j) = S′(j) for j ∈ J ′
and for a job j ∈ J \ J ′ (a job that was assigned by S either to core 1 or to core 2), S(j) = 1.
Since the makespan of (p, S) is M we know that the load on core 1 in the solution p, S is at
most 2M . It follows that the makespan of p, S is at most 18M .
When algorithm B fixes the size of the cache of the first core to be r = p(1), and considers
p′ ∈ P (K−r2 , d c2e − 1, 25) then it obtains the cache partition p. We know that S is a solution
to the corresponding scheduling problem with makespan at most 18M . Therefore Lenstra’s
approximation algorithm finds an assignment with makespan at most 36M .
3.2 Single load and minimal cache demand
We consider a special case of the general joint cache partition and job assignment problem where
each job has a minimal cache demand xj and single load value aj . Job j must run on a core
with at least xj cache and it contributes a load of aj to the core. We want to decide if the jobs
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 20
can be assigned to c cores, using K cache, such that the makespan is at most M? W.l.o.g. we
assume M = 1.
In Section 3.2.1 we describe a 2-approximate decision algorithm that if the given instance
has a solution of makespan at most 1, returns a solution with makespan at most 2 and otherwise
may fail. In Sections 3.2.2 and 3.2.3 we improve the approximation guarantee to 32 and 4
3 at
the expense of using 2K and 3K cache, respectively. In Section 3.2.4 we show how to obtain an
approximate optimization algorithm using an approximate decision algorithm and a standard
binary search technique.
3.2.1 2-approximation
We present a 2-approximate decision algorithm, denoted by A2. Algorithm A2 sorts the jobs in
a non-increasing order of their cache demand. It then assigns the jobs to the cores in this order.
It keeps assigning jobs to a core until the load on the core exceeds 1. Then, A2 starts assigning
jobs to the next core. Note that among the jobs assigned to a specific core the first one is the
most cache demanding and it determines the cache allocated to this core by A2. Algorithm A2
fails if the generated solution uses more than c cores or more than K cache. Otherwise, A2
returns the generated cache partition and job assignment.
Theorem 3.9. If there is a cache partition and job assignment of makespan at most 1 that use
c cores and K cache then algorithm A2 finds a cache partition and job assignment of makespan
at most 2 that use at most c cores and at most K cache.
Proof. Let Y = (p, S) be the cache partition and job assignment with makespan at most 1
whose existence is assumed by the lemma. Y has makespan at most 1 so the sum of the loads
of all jobs is at most c. Since A2 loads each core, except maybe the last one, with more than 1
load it follows that A2 uses at most c cores.
Since Y has makespan at most 1 the load of each of the jobs is at most 1. Algorithm A2
only exceeds a load of 1 on a core by the load of the last job assigned to this core and thus A2
yields a solution with makespan at most 2.
Assume w.l.o.g that the cores in Y are indexed such that for any core i, p(i + 1) ≤ p(i).
Assume that the cores in A2 are indexed in the order in which they were loaded by A2. By
the definition of A2 the cores are also sorted by non-increasing order of their cache allocation.
Denote by z(i) the amount of cache A2 allocates to core i. We show that for all i ∈ 1, . . . , c,
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 21
z(i) ≤ p(i). This implies that algorithm A2 uses at most K cache.
A2 allocates to the first core the cache required by the most demanding job so z(1) = maxj xj .
This job must be assigned in Y to some core and therefore z(1) ≤ p(1). Assume to the contrary
that z(i) > p(i) for some i. Each job j with cache demand xj > p(i) must be assigned in Y to
one of the first (i − 1) cores, because all the other cores don’t have enough cache to run this
job. Since Y has makespan at most 1 we know that∑
j|xj>p(i)aj ≤ (i− 1). Consider all the jobs
with cache demand at least z(i). Algorithm A2 failed to assign all these jobs to the first (i− 1)
cores, and we know that A2 assigns more than 1 load to each core. So∑
j|xj≥z(i)aj > (i − 1).
Since z(i) > p(i) and there is a job with cache demand z(i), we have∑
j|xj≥z(i)aj <
∑j|xj>p(i)
aj
which leads to a contradiction. Therefore z(i) ≤ p(i) for all i and algorithm A2 uses at most K
cache.
3.2.2 32-approximation with 2K cache
We define a job to be large if aj >12 and small otherwise. Our algorithm A 3
2assigns one large
job to each core. Let si be the load on core i after the large jobs are assigned. Let ri = 1− si.
We process the small jobs by non-increasing order of their cache demand xj , and assign them
to the cores in non-increasing order of the cores’ ri’s. We stop assigning jobs to a core when its
load exceeds 1 and start loading the next core. Algorithm A 32
allocates to each core the cache
demand of its most demanding job. Algorithm A 32
fails if the resulting solution uses more than
c cores or more than 2K cache.
Theorem 3.10. If there is a cache partition and job assignment of makespan at most 1 that
use c cores and K cache then A 32
finds a cache partition and job assignment that use at most
2K cache, at most c cores and have a makespan of at most 32 .
Proof. Let Y = (p, S) be the cache partition and job assignment with makespan at most 1 whose
existence is assumed by the lemma. The existence of Y implies that there are at most c large
jobs in our input and that the total volume of all the jobs is at most c. Therefore algorithm
A 32
uses at most c cores to assign the large jobs. Furthermore, when A 32
assigns the small jobs
it loads each core, except maybe the last one, with a load of at least 1 and thus uses at most c
cores. Algorithm A 32
provides a solution with makespan at most 32 since it can only exceed a
load of 1 on any core by the load of a single small job.
Let z be the cache partition generated by A 32. Let Cl be the set of cores whose most cache
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 22
demanding job is a large job and Cs be the set of cores whose most cache demanding job is a
small job. For core i ∈ Cl, Let ji be the most cache demanding job assigned to core i, so we have
z(i) = xji . The solution Y = (p, S) is a valid solution thus xji ≤ p(S(ji)) so z(i) ≤ p(S(ji)). If
j1, j2 are two large jobs then S(j1) 6= S(j2) and we get that∑i∈Cl
z(i) ≤∑i∈Cl
p(S(ji)) ≤c∑i=1
p(i) =
K.
In the rest of the proof we index the cores in the solution of A 32
such that r1 ≥ r2 . . . ≥ rc.This
is the same order in which A 32
assigns small jobs to the cores. In Y we assume that the cores
are indexed such that p(i) ≥ p(i + 1). We now prove the z(i) ≤ p(i) for any core i ∈ Cs.
Assume, to the contrary, that for some i, z(i) > p(i). Let α be the cache demand of the
most cache demanding small job on core i in Y . Let J1 = j | aj ≤ 12 , xj ≥ z(i) and let
J2 = j | aj ≤ 12 , xj > α). Since α ≤ p(i) and by our assumption p(i) < z(i) we get that
α < z(i) and therefore J1 ⊆ J2.
A 32
does not assign all the jobs of J1 to its first (i− 1) cores and therefore the total load of
the jobs in J1 is greater thani−1∑l=1
rl. On the other hand we know that in Y , assignment S assigns
all the jobs in J2 on its first i − 1 cores while not exceeding a load of 1. Thus the total load
of jobs in J2 is at most the space available for small jobs on the first (i − 1) cores in solution
Y . Since r1 ≥ r2 . . . ≥ rc, and since in any solution each core runs at most one large job, we
get thati−1∑l=1
rl is at least as large as the space available for small jobs in any subset of (i − 1)
cores in any solution. It follows that the total load of jobs in J2 is smaller than in J1. This
contradicts the fact that J1 ⊆ J2.
We conclude that for every i ∈ Cs, z(i) ≤ p(i). This implies that the total cache allocated
to cores in Cs is at most K. We previously showed that the total cache allocated to cores in Cl
is at most K and thus the total cache used by Algorithm A 32
is at most 2K.
3.2.3 43-approximation with 3K cache, using dominant matching
We present a 43 approximate decision algorithm, A 4
3, that uses at most 3K cache. The main
challenge is assigning the large jobs, which here are defined as jobs of load greater than 13 .
There are at most 2c large jobs in our instance, because we assume there is a solution of
makespan at most 1 that uses c cores. Algorithm A 43
matches these large jobs into pairs, and
assigns each pair to a different core. In order to perform the matching, we construct a graph G
where each vertex represents a large job j of weight aj >13 . If needed, we add artificial vertices
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 23
of weight zero to have a total of exactly 2c vertices in the graph. Each two vertices have an
edge between them if the sum of their weights is at most 1. The weight of an edge is the sum
of the weights of its endpoints.
A perfect matching in a graph is a subset of edges such that every vertex in the graph is
incident to exactly one edge in the subset. We note that there is a natural bijection between
perfect matchings in the graph G and assignments of makespan at most 1 of the large jobs to
the cores. The c edges in any perfect matching define the assignment of the large jobs to the
c cores as follows: Let (a, b) be an edge in the perfect matching. If both a and b correspond
to large jobs, we assign both these jobs to the same core. If a corresponds to a large job and
b is an artificial vertex, we assign the job corresponding to a to its own core. If both a and b
are artificial vertices, we leave a core without any large jobs assigned to it. Similarly we can
injectively map any assignment of the larges jobs of makespan at most 1 to a perfect matching
in G: For each core that has 2 large jobs assigned to it, we select the edge in G corresponding
to these jobs, for each core with a single large job assigned to it, we select an edge between the
corresponding real vertex and an arbitrary artificial vertex, and for each core with no large jobs
assigned to it we select an edge in G between two artificial vertices.
A dominant perfect matching in G is a perfect matching Q such that for every i, the i heaviest
edges in Q are a maximum weight matching in G of i edges. The graph G is a threshold graph
[MP95], and in Section 3.2.5 we provide a polynomial time algorithm that finds a dominant
perfect matching in any threshold graph that has a perfect matching. If there is a solution for
the given instance of makespan at most 1 then the assignment of the large jobs in that solution
correspond to a perfect matching in G and thus algorithm A 43
can apply the algorithm from
Section 3.2.5 and find a dominant perfect matching, Q, in G.
Algorithm A 43
then assigns the small jobs (load ≤ 13) similarly to algorithms A2 and A 3
2
described in Sections 3.2.1 and 3.2.2, respectively. It greedily assigns jobs to a core, until the
core’s load exceeds 1. Jobs are assigned in a non-increasing order of their cache demand and
the algorithm goes through the cores in a non-decreasing order of the sum of loads of the large
jobs on each core. Once all the jobs are assigned, the algorithm allocates cache to the cores
according to the cache demand of the most demanding job on each core. Algorithm A 43
fails if
it does not find a dominant perfect matching in G or if the resulting solution uses more than c
cores or more than 3K cache.
Theorem 3.11. If there is a solution that assigns the jobs to c cores with makespan at most 1
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 24
and uses K cache then algorithm A 43
assigns the jobs to c cores with makespan at most 43 and
uses at most 3K cache.
Proof. Let Y = (p, S) be a solution of makespan at most 1, that uses c cores and K cache.
Algorithm A 43
provides a solution with makespan at most 43 since it may only exceed a load
of 1 on any core by the load of a single small job.
Algorithm A 43
uses at most c cores to assign the large jobs because the assignment is based
on a perfect matching of size c in G. The existence of Y implies that the total load of all jobs
is at most c. When A 43
assigns the small jobs it exceeds a load of 1 on all cores it processes,
except maybe the last one, and therefore we get that A 43
uses at most c cores.
Let z be the cache partition generated by A 43. Let Cl be the set of cores whose most
demanding job is a large job and Cs be the set of cores whose most demanding job is a small
job.
Consider any core i ∈ Cl. Let j be the most cache demanding large job assigned to core i.
Job j runs in solution Y on some core S(j). Therefore z(i) = xj ≤ p(S(j)). Since each core in
Y runs at most two large jobs, we get that the total cache allocated by our algorithm to cores
in Cl is at most 2K.
Consider the large jobs assigned to cores according to the dominant perfect matching Q.
Denote by si the load on core i after the large jobs are assigned (and before the small jobs
are assigned) and let ri = 1 − si. W.l.o.g. we assume the cores in A 43
are indexed such that
r1 ≥ . . . ≥ rc. For every i,c∑l=i
sl is at least as large as this sum in any assignment of the large
jobs of makespan at most 1 because any such assignment defines a perfect matching in graph G
and ifc∑l=i
sl is larger in some other assignment then Q is not a dominant perfect matching in G.
Since the total volume of all large jobs is fixed, we get that for every core i the amount of free
volume on cores 1 till i,i∑l=1
rl, is maximal and can not be exceeded by any other assignment of
the large jobs of makespan at most 1.
W.l.o.g we assume that the cores in solution Y = (p, S) are indexed such that p(i) ≥ p(i+1).
Let i be any core in Cs. We show that z(i) ≤ p(i). Assume, to the contrary, that z(i) > p(i).
Let α be the cache demand of the most demanding small job assigned to core i in solution Y .
Let J1 = j | aj ≤ 13 , xj ≥ z(i) and J2 = j | aj ≤ 1
3 , xj > α. Since α ≤ p(i) < z(i), we get
that J1 ⊆ J2.
Solution Y assigns all the jobs in J2 to its first (i− 1) cores, without exceeding a makespan
of 1. Therefore the total volume of jobs in J2 is at most the total available space solution Y
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 25
has on its first (i− 1) cores after assigning the large jobs. Since we know that for every i,i∑l=1
rl
is maximal and can not be exceeded by any assignment of the large jobs of makespan at most
1, we get that the total volume of jobs in J2 is at mosti∑l=1
rl. Algorithm A 43
does not assign
all the jobs in J1 to its first (i− 1) cores, and since A 43
loads each of the first (i− 1) cores with
at least 1, we get that the total volume of jobs in J1 is greater thani∑l=1
rl. So we get that the
total volume of jobs in J2 is less than the total volume of jobs in J1 but that is a contradiction
to the fact that J1 ⊆ J2. Therefore we get that z(i) ≤ p(i), for every i ∈ Cs. It follows that the
total cache allocated by our algorithm to cores in Cs is at most K and this concludes the proof
that our algorithm allocates a total of at most 3K cache to all cores.
3.2.4 Approximate optimization algorithms for the single load, minimal cache
model
We presented approximation algorithms for the decision version of the joint cache partition
and job assignment problem in the single load and minimal cache demand model. If there is
a solution with makespan m, algorithms A2, A 32
and A 43
find a solution of makespan 2m, 3m2
and 4m3 , that uses K, 2K and 3K cache, respectively. We now show how to transform these
algorithms into approximate optimization algorithms using a standard binary search technique
[LST90].
Lemma 3.12. Given m, K and c, assume there is a polynomial time approximate decision
algorithm that if there is a solution of makespan m, K cache and c cores, returns a solution of
makespan αm, βK cache and c cores, where α and β are at least 1. Then, there is a polynomial
time approximation algorithm that finds a solution of makespan αmopt, βK cache and c cores,
where mopt is the makespan of the optimal solution with K cache and c cores.
Proof. Let’s temporarily assume that the loads of all jobs are integers. This implies that for
any cache partition and job assignment the makespan is an integer.
Our approximate optimization algorithm performs a binary search for the optimal makespan
and maintains a search range [L,U ]. Initially, U =n∑j=1
aj and L =⌈1cU⌉. Clearly these initial
values of L and U are a lower and an upper bound on the optimal makespan, respectively. Let
A be the approximate decision algorithm whose existence is assumed in the lemma’s statement.
In each iteration, we run algorithm A with parameters K, c and m = bL+U2 c. If A succeeds
and returns a solution with makespan at most αm we update the upper bound U := m. If A
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 26
fails, we know there is no solution of makespan at most m, and we update the lower bound
L := m + 1. It is easy to see that the binary search maintains the invariant that after any
iteration, if the search range is [L,U ] then mopt ∈ [L,αU ] and we have a solution of makespan
at most αU . The binary search stops when L = U .
The makespan of the solution when the binary search stops is at most αU = αL ≤ αmopt.
The binary search stops after O(log2(n∑j=1
aj)) iterations, and since A runs in polynomial time,
we get that our algorithm runs in polynomial time. This shows that our binary search algorithm
is a polynomial time α-approximation algorithm.
If the loads in our instance are not integers, let 12φ
be the precision in which the loads are
given. By multiplying all loads by 2φ we get an equivalent instance where all the loads of the
jobs are integers. Note that this only adds φ iterations to the binary search and our algorithm
still runs in polynomial time.
The following theorem follows immediately from Lemma 3.12.
Theorem 3.13. Using the approximate decision algorithms presented in this section, we obtain
polynomial time approximate optimization algorithms for the single load, minimal cache demand
problem with approximation factors 2, 32 and 4
3 that use K, 2K and 3K cache, respectively.
3.2.5 Dominant perfect matching in threshold graphs
Let G = (V,E) be an undirected graph with 2c vertices where each vertex x ∈ V has a weight
w(x) ≥ 0. The edges in the graph are defined by a threshold t > 0 to be E = (x, y) |
w(x) +w(y) ≤ t, x 6= y. Such a graph G is known as a threshold graph [CH73, MP95]. We say
that the weight of an edge (x, y) is w(x, y) = w(x) + w(y).
A perfect matching A in G is a subset of the edges such that every vertex in V is incident to
exactly one edge in A. Let Ai denote the i-th heaviest edge in A. We assume, w.l.o.g, that there
is some arbitrary predefined order of the edges in E that is used, as a secondary sort criteria,
to break ties in case several edges have the same weight. In particular, this implies that Ai is
uniquely defined.
Definition 3.14. A perfect matching A dominates a perfect matching B if for every x ∈
1, . . . , cx∑i=1
w(Ai) ≥x∑i=1
w(Bi)
Definition 3.15. A perfect matching A is a dominant matching if A dominates any other
perfect matching B.
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 27
Let A and B be two perfect matchings in G. We say that A and B share a prefix of length l
if Ai = Bi for i ∈ 1, . . . , l. The following greedy algorithm finds a dominant perfect matching
in a threshold graph G that has a perfect matching. We start with G0 = G. At step i, the
algorithm selects the edge (x, y) with maximum weight in the graph Gi. If there are several
edges of maximum weight, then (x, y) is the first by the predefined order on E. The graph Gi+1
is obtained from Gi by removing vertices x, y and all edges incident to x or y. The algorithm
stops when it selected c edges and Gc is empty.
Lemma 3.16. For every x ∈ 0, . . . , c− 1, If graph Gx has a perfect matching, then the graph
Gx+1 has a perfect matching.
Proof. Let Mx denote the perfect matching in graph Gx. Let (a, b) be the edge of maximum
weight in Gx that we remove, with its vertices and their incident edges, to obtain Gx+1. If
(a, b) ∈ Mx then clearly Mx \ (a, b) is a perfect matching in Gx+1. If (a, b) 6∈ Mx, and since
Mx is a perfect matching of Gx, there are two vertices c and d such that (a, c) and (b, d) are in
Mx. The edge (a, b) is the maximum weight edge in Gx and thus w(b) ≥ w(c) and w(a) ≥ w(d).
Therefore (c, d) must be an edge in Gx because w(c) + w(d) ≤ w(a) + w(b) ≤ t which is the
threshold defining the edges in our threshold graph. Let Mx+1 = Mx \ (a, c), (b, d) ∪ (c, d).
It is easy to see that Mx+1 is a perfect matching in graph Gx+1.
Theorem 3.17. If G is a threshold graph with 2c vertices that has a perfect matching, then the
greedy algorithm described above finds a dominant perfect matching.
Proof. Lemma 3.16 implies that our greedy algorithm is able to select a set of c edges that is a
perfect matching in G. Denote this matching by Q.
Assume, to the contrary, that Q is not a dominant perfect matching in G. Let A be a perfect
matching that is not dominated by Q and that share the longest possible prefix with Q. Let
x denote the length of the shared prefix of Q and A. Let Gx denote the graph obtained from
G by removing the x edges that are the heaviest in both A and Q, their vertices and all edges
incident to these vertices.
Let (a, b) = Qx+1. Since A and Q share a maximal prefix of length x, Ax+1 6= (a, b) .
Since (a, b) is of maximum weight in Gx, it follows that (a, b) 6∈ A (otherwise, it would have
been Ax+1). The set of edges Ax+1, . . . , Ac form a perfect matching of Gx so there must
be two edges and two indices l1 > x and l2 > x, such that Al1 = (a, d), Al2 = (b, c). We
assume w.l.o.g. that l1 < l2. The edge (a, b) is of maximum weight in Gx and therefore
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 28
w(a) ≥ w(c) and w(b) ≥ w(d). It follows that w(c, d) ≤ w(a, b) ≤ t, and therefore (c, d) ∈ Gx.
Let A′ = A\(a, d), (b, c)∪(a, b), (c, d). Clearly, A′ is a perfect matching in G, A′x+1 = (a, b)
and therefore A′ shares a prefix of length x + 1 with Q. If A′ dominates A, then since Q does
not dominate A, it follows that Q does not dominate A′. Thus A′ is a perfect matching that
shares a prefix of length x + 1 with Q and is not dominated by Q. This is a contradiction to
the choice of A. We finish the proof by showing that A′ dominates A.
Let l3 be the index such that A′l3 = (c, d). Since w(b) ≥ w(d), l3 > l2. Let ∆(l) =l∑
i=1w(A′i)−
l∑i=1
w(Ai). The matchings A′ and A share a prefix of length x, so for every 1 ≤ l ≤ x,
∆(l) = 0. For x + 1 ≤ l < l1, ∆(l) = w(a, b) − w(Al) ≥ 0 since (a, b) is the edge of maximum
weight in Gx. For l1 ≤ l < l2, ∆(l) = w(a, b) − w(a, d) ≥ 0 also by the maximality (a, b). For
l2 ≤ l < l3, ∆(l) = w(A′l) − w(c) − w(d) which is non-negative because l < l3 and therefore
w(A′l) ≥ w(A′l3) = w(c) + w(d). For l ≥ l3, ∆(l) = 0. This shows that A′ dominates A and
concludes our proof that Q is a dominant perfect matching in G.
On dominant perfect matchings in d-uniform hypergraphs
The problem of finding a dominant perfect matching in a d-uniform threshold hypergraph1 that
has a perfect matching is interesting in the context of the single load, minimal cache demand
special case of the joint cache partition and job assignment problem. If we can find such a
matching then an algorithm, similar to Algorithm A 43
in Section 3.2.3, would give a solution
that uses (d+ 1)K cache and approximates the makespan up to a factor of d+2d+1 .
However, the following example shows that in a 3-uniform threshold hypergraph that has
a perfect matching, a dominant perfect matching does not necessarily exist. Let ε > 0 be an
arbitrarily small constant. Consider a hypergraph with 12 vertices, 3 vertices of each weight
in 13 ,29 ,
49 − ε, ε. Each triplet of vertices is an edge if the sum of its weights is at most 1.
This hypergraph has a perfect matching. In fact, let’s consider two perfect matchings in this
hypergraph. Matching A consists of the edges (13 ,13 ,
13), (49− ε,
49− ε, ε), (49− ε,
29 ,
29) and (29 , ε, ε).
Matching B consists of three edges of the form (13 ,29 ,
49 − ε) and one edge of the form (ε, ε, ε).
It is easy to check that A and B are valid perfect matchings in this hypergraph. Any dominant
perfect matching in this hypergraph must contain the edge (13 ,13 ,
13) in order to dominate A,
since this is the only edge of weight 1 in this hypergraph. The sum of the two heaviest edges
1 A d-uniform threshold hypergraph is defined on a set of vertices, V , each with a non-negative weight w(v).The set of edges, E, contains all the subsets S ⊂ V of size d such that the sum of the weights of the vertices inS is at most some fixed threshold t > 0.
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 29
in matching B is 2 − 2ε and therefore any dominant perfect matching must have an edge of
weight at least 1 − 2ε, as otherwise the matching will not dominate matching B. But, if the
edge (13 ,13 ,
13) is in the dominant matching, then all edges disjoint from (13 ,
13 ,
13) have a weight
smaller than 1− 2ε. Thus no dominant perfect matching exists in this hypergraph.
Matching A in the example above is the perfect matching found by applying the greedy
algorithm to this hypergraph. It is interesting to note that in a 3-uniform threshold hypergraph,
the greedy algorithm does not necessarily find a perfect matching at all. This is because Lemma
3.16 does not extend to 3-uniform threshold hypergraphs. Let ε > 0 be an arbitrarily small
constant. Consider a hypergraph with 9 vertices, 3 vertices of each weight in 13 ,29 ,
49 − ε. Each
triplet of vertices is an edge if the sums of its weights is at most 1. This hypergraph has a perfect
matching since the 3 edges of the form (13 ,29 ,
49 − ε) are a perfect matching in this hypergraph.
However the greedy algorithm first selects the edge (13 ,13 ,
13) and then selects an edge of the
form (29 ,29 ,
4−ε9 ). The remaining hypergraph now contains three vertices and no edges, so the
greedy algorithm is stuck and fails to find a perfect matching.
3.2.6 PTAS for jobs with correlative single load and minimal cache demand
The main result in this section is a polynomial time approximation scheme for instances of the
single load minimal cache demand problem, where there is a correlation between the load and
the cache demand of jobs with non-zero cache demand. This special case is motivated by the
observation that often there is some underlying notion of a job’s “hardness” that affects both
its load and its minimal cache demand.
Consider an instance of the single load minimal cache demand problem such that for any
two jobs j, j′ such that xj and xj′ are non-zero, aj ≤ aj′ ⇐⇒ xj ≤ xj′ . We call a job j such
that xj > 0 a demanding job and a job j such that xj = 0 a non-demanding job. We consider
the following decision problem: We want to decide if there is a cache partition of K cache to
c cores and an assignment of jobs to the cores such that the jobs’ minimal cache demands are
satisfied and that the resulting makespan is at most m? By scaling down the loads of the jobs
by m, we assume w.l.o.g that m = 1.
Let ε > 0. We present an algorithm that if there is a cache partition and a job assignment
with makespan at most 1, returns a cache partition and a job assignment with makespan at
most (1 + 2ε). Otherwise, our algorithm either decides that there is no solution of makespan at
most 1 or returns a solution of makespan at most (1 + 2ε). Combining this algorithm with a
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 30
binary search, we obtain a PTAS.
If there is a job j such that aj > 1 then our algorithm decides that there is no solution of
makespan at most 1. Thus we assume that for any j, aj ≤ 1.
Let J = J1 ∪ J2, J1 = j ∈ J | aj ≥ ε, J2 = J\J1. In the first phase, we deal only with
jobs in J1. For each j ∈ J1 let uj = maxu ∈ N | ε + uε2 ≤ aj. We say that ε + ujε2 is the
rounded-down load of job j.
Let UD = uj | j ∈ J1, xj > 0 and UND = uj | j ∈ J1, xj = 0. An assignment pattern
of a core is a table that indicates for each u ∈ UD how many demanding jobs of rounded-down
load ε + uε2 are assigned to the core and for each u ∈ UND how many non-demanding jobs of
rounded-down load ε+ uε2 are assigned to the core. Note that an assignment pattern of a core
does not identify the actual jobs assigned to the core. We only consider assignment patterns
whose rounded-down load is at most 1.
A configuration of cores is a table indicating how many cores we have of each possible
assignment pattern. A configuration of cores T is valid if for every u ∈ UD, the number of
demanding jobs in J1 whose uj = u equals the sum of the numbers of demanding jobs with
uj = u in all assignment patterns in T and, similarly, for every u ∈ UND, the number of non-
demanding jobs in J1 whose uj = u equals the sum of the numbers of non-demanding jobs with
uj = u in all assignment patterns in T .
The outline of our algorithm is as follows. The algorithm enumerates over all valid configu-
rations of cores. For each valid configuration T , we find an actual assignment of the jobs in J1
that matches T and minimizes the total cache used. We then proceed to assign the jobs in J2,
in a way that guarantees that if there is a solution of makespan 1 and K cache that matches
this configuration of cores, then we obtain a solution of makespan at most (1 + 2ε) and at most
K cache. If our algorithm does not generate a solution of makespan at most (1 + 2ε) and at
most K cache, for all valid configurations of cores, then our algorithm decides that no solution
of makespan at most 1 exists.
Let T be a valid configuration of cores. For each core i ∈ 1, . . . , c, let qi be the maximal
rounded-down load of a demanding job assigned to core i according to the assignment pattern
of core i in T . Let αi be the number of demanding jobs of rounded-down load qi on core
i, according to T . We assume w.l.o.g that the cores are indexed such that qi ≥ qi+1. Let
Q = qi | i ∈ 1, . . . , c. For each q ∈ Q, let s(q) be the index of the first core i with qi = q
and let e(q) be the index of the last core i with qi = q. Assume that the cores s(q), . . . , e(q) are
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 31
indexed such that αs(q) ≥ . . . ≥ αe(q). Let J1(q) = j ∈ J1 | xj 6= 0, ε+ ujε2 = q, the set of all
demanding jobs in J1 whose rounded down load is q. Let Y (q) be the set of thee(q)∑i=s(q)
αi jobs of
smallest cache demands in J1(q).
Our algorithm builds an assignment matching T of minimal cache usage among all assign-
ments matching T . To do so, our algorithm goes over Q in a decreasing order and distributes the
jobs in Y (q) to the cores s(q), . . . , e(q) in this order of the cores such that core i ∈ [s(q), e(q)], in
turn, gets the αi most cache demanding jobs in Y (q) that are not yet assigned. After we assign
the demanding jobs with the maximal rounded-down load on each core, our algorithm arbi-
trarily chooses the identity of all other jobs in the configuration T . These are non-demanding
jobs and demanding jobs whose rounded-down load is not of the maximal rounded-down load
on their core. Each core is allocated cache according to the cache demand of the most cache
demanding job that is assigned to it.
The algorithm continues with the jobs in J2. It first assigns the demanding jobs in J2, in
the following greedy manner. Order these jobs from the most cache demanding to the least
cache demanding. For each core, we consider two load values: its actual load which is the sum
of the actual loads of jobs in J1 assigned to the core, and its rounded down load which is the
sum of rounded down loads of jobs in J1 assigned to the core. We order the cores such that
first we have all the cores that already had some cache allocated to them in the previous phase
of the algorithm, in an arbitrary order. Following these cores, we order the cores with no cache
allocated to them, from the least loaded core to the most loaded core, according to their rounded
down loads. These cores are either empty or have only non demanding jobs, from J1, assigned
to them. The algorithm assigns the jobs to the cores in these orders (of the jobs and of the
cores) and stops adding more jobs to a core and moves to the next one when the core’s actual
load exceeds 1 + ε. After all these jobs are assigned, the algorithm adjusts the cache allocation
of the cores whose most cache demanding job is now a job of J2.
Finally, it assigns the non-demanding jobs in J2. Each such job is assigned arbitrarily to a
core whose actual load does not already exceed 1 + ε.
Lemma 3.18. The number of valid configurations of cores is O(cO(1)).
Proof. We first consider the number of assignment patterns with rounded-down load at most
1. Since for each job j, aj ≤ 1, the size of UD and the size of UND are at most⌊1−εε2
⌋=
O( 1ε2
) = O(1). In an assignment pattern of load at most 1, there are at most 1ε jobs in J1
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 32
assigned to each core and thus we get that the number of possible assignment patterns is at
most O((1ε )( 1ε2
)) = O(1). Since the number of assignment patterns we consider is O(1), it follows
that the number of possible configurations of cores is O(cO(1)).
Since our algorithm spends a polynomial time per configuration of cores then Lemma 3.18
implies that our algorithm runs in polynomial-time.
Lemma 3.19. For any configuration of cores T there is an assignment matching T of minimal
cache usage among all assignments matching T , that for each q ∈ Q assigns thee(q)∑i=s(q)
αi least
cache demanding jobs in J1(q) (i.e. the set of jobs Y(q)) to the cores s(q), . . . , e(q).
Proof. Consider a job assignment S of minimal cache usage that matches T . Assume that for
some q ∈ Q assignment S does not assign all the jobs in Y (q) to the cores s(q), . . . , e(q). So
there is a core i ∈ [s(q), e(q)] that runs a job j ∈ J1(q) \ Y (q).
Since S assignse(q)∑i=s(q)
αi jobs from J1(q) to cores s(q), . . . , e(q) and since jobs in J1(q) cannot
be assigned to cores i′ > e(q), it follows that there is a core i′ < s(q) and a job j′ ∈ Y (q) such
S(j′) = i′. Suppose we switch the assignment of jobs j and j′ and run job j on core i′ and job
j′ on core i. Let S′ denote the resulting assignment. The cache required by core i′ does not
increase, as it runs demanding jobs of rounded down load greater than q and therefore of cache
demand greater than the cache demand of job j. By the choice of the jobs j and j′ we know
that xj′ ≤ xj and therefore the cache required by core i in S′ can only decrease compared to
the cache required by core i in S. It follows that the cache usage of S′ is at most that of S and
since S is of the minimal cache usage of all assignments that match T , we get that the cache
usage of S′ must be the same as of S.
By repeating this argument as long as there is a job that violates Lemma 3.19, we obtain
an assignment as required.
Lemma 3.20. For any configuration of cores T , Let S be an assignment matching T such
that for each q ∈ Q and for each core i ∈ [s(q), e(q)], if we index the jobs in Y (q) from the
most cache demanding to the least cache demanding, assignment S assigns to core i the jobs in
Y (q) of indicesi−1∑
j=s(q)
αj + 1, . . . ,i∑
j=s(q)
αj. Assignment S is of minimal cache usage, among all
assignments matching T .
Proof. Assume to the contrary that assignment S is not of minimal cache usage, among all
assignments matching T . Let S′ be an assignment whose existence is guaranteed by Lemma
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 33
3.19. Since S and S′ have different cache usages, there exists q ∈ Q such that S and S′ differ on
their assignment of the jobs in Y (q). We index the jobs in Y (q) from the most cacn demanding
to the least cache demanding. Let j ∈ Y (q) be the first job (most cache demanding) in Y (q)
such that S(j) 6= S′(j). We select S′ such that it maximizes j among all assignments satisfying
Lemma 3.19 that disagree with S on the assignment of the jobs in Y (q).
Denote i = S(j) and i′ = S′(j). Since S and S′ both assign αi jobs from Y (q) to core i and
since j is the first job in Y (q) on which S and S′ disagree, then there is a job j2 ∈ Y (q), j2 > j
such that S′(j2) = i.
We first assume that there is a job j1 < j such that S(j1) = i. Let S′′ be the assignment
such that S′′(j) = i, S′′(j2) = i′ and for any job h 6∈ j,j2, S′′(h) = S′(h). The cache required
by core i′ in S′′ is at most the cache required by core i′ in S′, since j < j2. Since j1 < j and
S(j1) = i, we know that S′(j1) = i and also S′′(j1) = i. This implies that in S′′, core i requires
the same amount of cache as in S′. It follows that S′′ is also an assignment of minimal cache
usage, and that it satisfies Lemma 3.19. Since S′′(j) = S(j), we get a contradiction to the way
we selected S′. Thus S is of minimal cache usage, among all assignments matching T .
We now assume that j is the first job in Y (q) such that S(j) = i. Let S′′ be the following
assignment. Any job that is assigned by S′ to a core different than i and i′ is assigned by S′′ to
the same core. For any job x such that S′(x) = i′, S′′(x) = i. All the αi′ least cache demanding
jobs assigned by S′ to core i are assigned by S′′ to core i′. Note that αi ≥ αi′ and therefore
assignment S′′ is well defined.
Since S and S′ agrees on the assignment of jobs j < j in Y (q) and assign them to cores l < i,
then job j is the most cache demanding job assigned to cores l ≥ i by S′ and S′′. Therefore
in assignment S′, core i′ requires xj cache and in assignment S′′ core i requires xj cache. In
assignment S′′, core i′ is assigned a set of jobs that is a subset of the jobs assigned to core i by
S′. Thus the cache required by core i′ in assignment S′′, is at most the cache required by core
i in assignment S′. It follows that S′′ is also an assignment of minimal cache usage, and that it
satisfies Lemma 3.19. This contradicts the choice of S′ and concludes the proof that assignment
S is of minimal cache usage, among all assignments matching T .
Corollary 3.21. For each configuration of cores T our algorithm builds an actual assignment
of minimal cache usage of the jobs in J1 that matches T .
Proof. The assignment returned by our algorithm is an assignment S, as in the statement of
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 34
Lemma 3.20.
Lemma 3.22. Consider an instance of the correlative single load minimal cache demand prob-
lem. If there is a cache partition and job assignment that schedules the jobs on c cores, uses at
most K cache and has a makespan of at most 1 then our algorithm finds a cache partition and
job assignment that schedules the jobs on c cores, uses at most K cache and has a makespan of
at most (1 + 2ε).
Proof. Let A be a solution of makespan at most 1 with c cores and K cache, whose existence
is assumed by the lemma. Let TA be the configuration of the cores corresponding to the
assignment of the jobs in J1 by solution A and assume our algorithm currently considers TA in
its enumeration.
We show that our algorithm succeeds in assigning all the jobs to c cores. Let’s assume to
the contrary that it fails. It can only fail if all cores are assigned an actual load of more than
(1 + ε) and there are still remaining jobs to assign. This indicates that the total volume to
assign is larger than c(1 + ε), which contradicts the fact that assignment A is able to assign the
jobs to c cores with makespan at most 1.
Let S denote the assignment of all jobs on c cores that out algorithm returns when it
considers TA. We know that S matches TA for jobs in J1. We now show that in S each core has
an actual load of at most 1+2ε. When we restrict S to J1 we know that the rounded down load
on each core is at most 1 and that each core has at most 1ε jobs from J1 assigned to it. Since
the actual load of any job in J1 is at most ε2 larger than its rounded down load, we get that if
we restrict assignment S to J1, the actual load on each core is at most 1 + ε2
ε = 1 + ε. The way
our algorithm assigns the jobs in J2 implies that the actual load of a core in assignment S can
only exceed 1 + ε by the load of a single job from J2. Therefore the actual load on any core in
assignment S is at most 1 + 2ε.
We show that assignment S uses at most K cache. Cache is allocated by our algorithm in
two steps: when it decides on the actual assignment of the jobs in J1 that matches TA and when
it assigns the demanding jobs in J2. Lemma 3.21 shows that S restricted to J1 is of minimal
cache usage of all assignments matching TA and thus uses at most the same amount of cache
as assignment A restricted to J1.
We show that when we also take into account the demanding jobs in J2, S uses at most the
same amount of cache as A. Assume the cores in S are indexed according to the order in which
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 35
our algorithm assigns demanding jobs from J2 to them. Assume the cores in A are indexed
such that core i in S and core i in A have the same assignment pattern. For any core in S, we
say that its free space is (1 + ε) minus the sum of the actual loads of all jobs in J1 assigned to
it by S. For any core in A, we say that its free space is 1 minus the sum of the actual loads of
all jobs in J1 assigned to it by A. For any i, core i in S has the same rounded down load as
core i in A and the actual load of core i in S is at most ε larger than the actual load of core i
in A. Therefore, by the definition of free space, the free space of core i in solution S is at least
the free space of core i in solution A.
Let i2 be the number of cores in S that have a demanding job from J1 assigned to them.
When our algorithm assigns jobs in J2 to a core i ≤ i2, it does not increase the cache required
by core i since any demanding job in J1 is at least as cache demanding as any demanding job
in J2. It follows that the total cache required by cores 1, . . . , i2 in S is at most the total cache
required by cores 1, . . . , i2 in A.
Let i > i2 be a core in S whose cache demand is determined by a job from J2. We now show
that core i in S requires no more cache than core i in A. This will conclude the proof that S
uses at most K cache.
The total load of demanding jobs in J2 that S assigns to cores 1, . . . , i − 1 is at least the
sum of the free space of these cores, since our algorithm exceeds an actual load of 1 + ε on each
core before moving the next. The sum of the free space of cores 1, . . . , i− 1 in S is at least the
sum of the free space of the cores 1, . . . , i−1 in A, which in turn is an upper bound on the total
load of demanding jobs from J2 that are assigned in A to cores 1, . . . , i−1. Since our algorithm
assigns the demanding jobs in J2 in a non-increasing order of their cache demand we get that
the cache demand of the most cache demanding job from J2 on core i in S is at most the cache
demand of the most cache demanding job in J2 on core i in A.
Lemma 3.22 shows that for any ε′ > 0, we have a polynomial time (1 + 2ε′)-approximate
decision algorithm. Given ε > 0, by applying our algorithm with ε′ = ε/2 we obtain a polynomial
time (1 + ε)-approximate decision algorithm.
By using a binary search similar to the one in Lemma 3.12 we obtain an (1+ε)-approximation
for the optimization problem, using our (1 + ε)-approximate decision algorithm. To conclude,
we have proven the following theorem.
Theorem 3.23. There is a polynomial time approximation scheme for the joint cache partition
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 36
and job assignment problem, when the jobs have a correlative single load and minimal cache
demand.
3.3 Step functions with a constant number of load types
Empirical studies [Dre07] suggest that the load of a job, as a function of available cache, is
often similar to a step-function. The load of the job drops at a few places where the cache size
exceeds the working-set required by some critical part. In between these critical cache sizes the
load of the job decreases negligibly with additional cache. The problems we consider in this
section are motivated by this observation.
Formally, each job j ∈ J is described by two load values lj < hj and a cache demand
xj ∈ 0, . . . ,K. If job j is running on a core with at least xj cache then it takes lj time and
otherwise it takes hj time. If a job is assigned to a core that meets its cache demand, xj , we
say that it is assigned as a small job. If it is assigned to a core that doesn’t meet its cache
demand we say that it is assigned as a large job. At first we study the case where the number
of different load types is constant and then we show a polynomial time scheduling algorithm for
the corresponding special case of the ordered unrelated machines scheduling problem.
Let L = lj | j ∈ J and H = hj | j ∈ J, the sets of small and large loads, respectively.
Here we assume that |L| and |H| are both bounded by a constant.
For each α ∈ L, β ∈ H, we say that job j is of small type α if lj = α and we say that job
j is of large type β if hj = β. If job j is of small type α and large type β we say that it is of
load type (α, β). Note that jobs j1, j2 of the same load type may have different cache demands
xj1 6= xj2 and thus the number of different job types is Ω(K) and not O(1).
We reduce this problem to the single load minimal cache demand problem studied in Section
3.2. For each load type (α, β), we enumerate on the number, x(α, β), of the jobs of load type
(α, β) that are assigned as large jobs. For each setting of the values x(α, β) for all load types,
we create an instance of the single load minimal cache demand problem in which each job
corresponds to a job in our original instance. For each job j which is one of the x(α, β) most
cache demanding jobs of load type (α, β) we create a job of load β and cache demand 0. For
each job j of load type (α, β) which is not one of the x(α, β) most cache demanding jobs of
this load type, we create a job of load α and cache demand xj . We solve each of the resulting
instances using any algorithm for the single load minimal cache demand problem presented in
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 37
Section 3.2, and choose the solution with the minimal makespan. We transform this solution
back to a solution of the original instance, by replacing each job with its corresponding job in
the original instance. Note that this does not affect the makespan or the cache usage.
Lemma 3.24. Given a polynomial time α-approximation algorithm for the single load minimal
cache demand problem that uses at most βK cache, the reduction described above gives a poly-
nomial time α-approximation algorithm for the problem where job loads are step functions with
a constant number of load types, that uses at most βK cache.
Proof. Consider an instance of the joint cache partition and job assignment problem with load
functions that are step functions with a constant number of load types. Assume there is a
solution A for this instance of makespan m that uses at most K cache. Let x(α, β) be the
number of jobs of load type (α, β) that are assigned in A as large jobs. W.l.o.g we can assume
that for each (α, β), the x(α, β) jobs that are assigned as large jobs are the x(α, β) most
cache demanding jobs of load type (α, β). The existence of A implies that when our algorithm
considers the same values for x(α, β), for each (α, β), it generates an instance of the single
load cache demand problem that has a solution of makespan at most m and at most K cache.
Applying the α-approximation algorithm for the single load minimal cache demand problem,
whose existence in assumed by the lemma, on this instance yields a solution of makespan at
most αm that uses at most βK cache. This solution is transformed to a solution of our original
instance without affecting the makespan or the cache usage.
Our algorithm runs in polynomial time since the size of the enumeration is O(n|L||H|).
Corollary 3.25. For instances in which the load functions are step functions with a constant
number of load types there are polynomial time approximation algorithms that approximate the
makespan up to a factor of 2, 32 and 4
3 and use at most K, 2K and 3K, respectively.
3.3.1 The corresponding special case of ordered unrelated machines
Recall that if we fix the cache partition in an instance of the joint cache partition and job
assignment problem then we obtain an instance of the ordered unrelated machines scheduling
problem. For the case where the load functions are step functions with a constant number of
load types, the resulting ordered unrelated machines instance can be solved in polynomial time
using the dynamic programming algorithm described below. The dynamic program follows a
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 38
structure similar to the one used in [EL11], where polynomial time approximation schemes are
obtained for several variants of scheduling with restricted processing sets.
In this special case of the ordered unrelated scheduling problem job j runs in time lj on
some prefix of the machines, and in time hj on the suffix (we assume that the machines are
ordered in non-increasing order of their strength/cache allocation). For simplicity, we assume
xj is given as the index of the first machine on which job j has load hj . If job j takes the same
amount of time to run regardless of cache, we assume xj = c + 1 and its load on any machine
is lj . As before, we assume that L = lj | j ∈ J and H = hj | j ∈ J are of constant size.
We design a polynomial time algorithm that finds a job assignment that minimizes the
makespan. The algorithm does a binary search for the optimal makespan, as in Section 3.2.4,
using an algorithm for the following decision problem: Is there an assignment of the jobs J to
the c machines with makespan at most M? By scaling the loads, we assume that M = 1.
For every machine m, we define Sm = j ∈ J | xj = m+1, the set of all jobs that are large
on machine m+ 1 and small on any machine i ≤ m. Let Sm(α, β) = j ∈ Sm | lj = α, hj = β
and bm(α, β) = |Sm(α, β)|. It is convenient to think of bm as a vector in 0, . . . , nL×H .
Let a ∈ 0, . . . , nL×H , δ ∈ 0, . . . , nH and m be any machine. Let J(m, a) be a set of
jobs which contains all the jobs in⋃mi=1 Si together with additional a(α, β) jobs of load type
(α, β) fromc⋃
i=m+1Si, for each load type (α, β). Let πm(a, δ) be 1 if we can schedule all the jobs
in J(m, a), except for δ(β) jobs of each large load type β, on the first m machines. Note that
since the additional jobs specified by a are small on all machines 1, . . . ,m, πm(a, δ) does not
depend on the additional jobs’ identity. Our original decision problem has a solution if and only
if πc(~0,~0) = 1.
Consider the decision problem π1(a, δ). We want to decide if it is possible to schedule the
jobs in J(1, a), except for δ(β) jobs of each large load type β, on machine 1. To decide this,
our algorithm chooses the δ(β) jobs of each large job type β that have the largest small loads
and removes them from J(1, a). If the sum of the small loads of the remaining jobs is at most
1, then π1(a, δ) = 1, and otherwise π1(a, δ) = 0.
To solve πm(a, δ) we enumerate, for each load type (α, β), on ξ(α, β), the number of jobs in
J(m, a) of this load type that are assigned as small jobs to machine m. Note that these jobs
are either in Sm(α, β) or in the additional set of a(α, β) jobs of type (α, β). For each β ∈ H, we
enumerate on the number λ(β) of jobs in J(m, a) of large load type β that are assigned as large
jobs to machine m. The following lemma is the basis for our dynamic programming scheme.
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 39
Its proof is straightforward.
Lemma 3.26. We can schedule all the jobs in J(m, a) except for δ(β) jobs of large load type
β (for each β ∈ H) on machines 1, . . . ,m with makespan at most 1 such that ξ(α, β) jobs of
load type (α, β) are assigned to machine m as small jobs and λ(β) jobs of large load type β are
assigned to machine m as large jobs if and only if the following conditions hold:
• For each (α, β) ∈ L ×H, ξ(α, β) ≤ a(α, β) + bm(α, β): The number of jobs of each load
type that we assign as small jobs to machine m is at most the number of jobs in J(m, a)
of this load type that are small on machine m.
•∑β∈H
λ(β)β +∑
(α,β)∈L×Hξ(α, β)α ≤ 1. The total load of the jobs assigned to machine m
is at most 1.
• Let a′ = a+ bm − ξ and δ′ = δ+ λ then πm−1(a′, δ′) = 1. The jobs in J(m− 1, a′), except
for δ′(β) jobs of large load β for each β ∈ H, can be scheduled on machines 1, . . . ,m− 1
with makespan at most 1.
The algorithm for solving πm(a, δ) sets πm(a, δ) = 1 if it finds λ and ξ such that the
conditions in Lemma 3.26 are met. If the conditions are not met for all λ and ξ then πm(a, δ) = 0.
Our dynamic program solves πm(a, δ) in increasing order of m from 1 to c and returns the
result of πc(~0,~0). The correctness of the dynamic program follows from Lemma 3.26 and from
the fact that for m = 1, our algorithm chooses the jobs that it does not assign to machine 1
such that the remaining load on machine 1 is minimized. Therefore we set π1(a, δ) = 1 if and
only if there is a solution of makespan at most 1.
By adding backtracking links, our algorithm can also construct a schedule with makespan at
most 1. We maintain links between each πm(a, δ) that is 1 to a corresponding πm−1(a′, δ′) that
is also 1, according to the last condition in Lemma 3.26. Tracing back the links from πc(~0,~0)
gives us an assignment with makespan at most 1 as follows. Consider a link between πm(a, δ)
and πm−1(a′, δ′). This defines λ = δ′ − δ and ξ = a + bm − a′. For each (α, β) we assign to
machine m, ξ(α, β) arbitrary jobs of load type (α, β) fromc⋃
i=mSi that we have not assigned
already, and we reserve λ(β) slots of load β on machine m to be populated with jobs later. Our
algorithm guarantees that the load on machine m is at most 1. When we reach π1(a, δ), for
some a and δ, in the backtracking phase, we have δ(β) slots of size β allocated on machines
2, . . . ,m. The δ(β) jobs of large load β with the largest small loads in J(1, a) are assigned to
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 40
these slots. Note that these jobs may be large on their machine and have a load of β or they
may be small and have a load smaller than β. In any case, the resulting assignment assigns all
the jobs in J and has a makespan of at most 1.
The number of problems πm(a, δ) that our dynamic program solves isO(cn|L||H|) = O(cnO(1)).
To solve each problem, we check the conditions in Lemma 3.26 for O(n|L||H|) possible λ’s and
ξ’s. This takes O(1) per λ and ξ since we already computed πm−1(a′, δ′) for every a′ and δ′.
Thus the total complexity of this algorithm is polynomial. This concludes the proof of the
following theorem.
Theorem 3.27. Our dynamic programming algorithm is a polynomial-time exact optimization
algorithm for the special case of the ordered unrelated machines scheduling problem, where each
job j has load lj on some prefix of the machines, and load hj ≥ lj on the corresponding suffix.
3.4 Joint dynamic cache partition and job scheduling
We consider a generalization of the joint cache partition and job assignment problem that allows
for dynamic cache partitions and dynamic job assignments. We define the generalized problem
as follows. As before, J denotes the set of jobs, there are c cores and a total cache of size K.
Each job j ∈ J is described by a non-increasing function Tj(x).
A dynamic cache partition p = p(t, i) indicates the amount of cache allocated to core i at
time unit t 2. For each time unit t,c∑i=1
p(t, i) ≤ K. A dynamic assignment S = S(t, i) indicates
for each core i and time unit t, the index of the job that runs on core i at time t. If no job
runs on core i at time t then S(t, i) = −1. If S(t, i) = j 6= −1 then for any other core i2 6= i,
S(t, i2) 6= j. Each job has to perform 1 work unit. If job j runs for α time units on a core with x
cache, then it completes αTj(x)
work. A partition and schedule p, S are valid if all jobs complete
their work. Formally, p, S are valid if for each job j,∑
<t,i>∈S−1(j)
1Tj(p(t,i))
= 1. The load of core
i is defined as the maximum t such that S(t, i) 6= −1. The makespan of p, S is defined as the
maximum load on any core. The goal is to find a valid dynamic cache partition and dynamic
job assignment with a minimal makespan.
It is easy to verify that dynamic cache partition and dynamic job assignment, as defined
above, generalize the static partition and static job assignment. The partition is static if for
every fixed core i, p(t, i) is constant with respect to t. The schedule is a static assignment if for
2To simplify the presentation we assume that time is discrete.
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 41
every job j, there are times t1 < t2 and a core i such that S−1(j) = < t, i >| t1 ≤ t ≤ t2.
We consider four variants of the joint cache partition and job assignment problem. The
static partition and static assignment variant studied so far, the variant in which the cache
partition is dynamic and the job assignment is static, the variant in which the job assignment
is dynamic and the cache partition is static and the variant in which both are dynamic.
Note that in the variant where the cache partition is dynamic but the job assignment is
static we still have to specify for each core, in which time units it runs each job that is assigned
to this core. That is, we have to specify a function S(t, i) for each core i. This is due to the fact
that different schedules that have the same set of jobs assigned to a particular core, when the
cache partition is dynamic, may have different loads, since jobs may run with different cache
allocations. When the cache partition is also static, the different schedules that have the same
set of jobs on a particular core have the same load, and it suffices to specify which jobs are
assigned to which core.
We study the makespan improvement that can be gained by allowing a dynamic solution. We
show that allowing a dynamic partition and a dynamic assignment can improve the makespan
by a factor of at most c, the number of cores. We also show an instance where by using a
dynamic partition and a static assignment we achieve an improvement factor arbitrarily close
to c. We show that allowing a dynamic assignment of the jobs, while keeping the cache partition
static, improves the makespan by at most a factor of 2, and that there is an instance where an
improvement of 2− 2c is achieved, for c ≥ 2.
Given an instance of the joint cache partition and job assignment problem, we denote by OSS
the optimal static cache partition and static job assignment, by ODS the optimal dynamic cache
partition and static job assignment, by OSD the optimal static cache partition and dynamic job
schedule and by ODD the optimal dynamic cache partition and dynamic job schedule. For any
solution A we denote its makespan by M(A).
Lemma 3.28. For any instance of the joint cache partition and job assignment problem,
M(OSS) ≤ cM(ODD).
Proof. Let A be the trivial static partition and schedule, that assigns all jobs to the first core
and allocates all the cache to this core. Let’s consider any job j that takes a total of α time
to run in the solution ODD. Whenever a fraction of job j runs on some core with some cache
partition, it has at most K cache available to it. Therefore, in solution A, when we run job j
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 42
continuously on one core with K cache, it take at most α time. Since the total running time of
all the jobs in solution ODD is at most cM(ODD), we get M(OSS) ≤M(A) ≤ cM(ODD).
Corollary 3.29. For any instance of the joint cache partition and job assignment problem,
M(OSS) ≤ cM(ODS).
Proof. Clearly, M(ODS) ≥ M(ODD) for any instance. Combine this with Lemma 3.28 and we
get that M(OSS) ≤ cM(ODS)
Lemma 3.30. For any ε > 0 there is an instance of the joint cache partition and job assignment
problem, such that M(OSS) > (c− ε)M(ODS).
Proof. Let b be an arbitrary constant. Let’s consider the following instance with two types of
jobs. There are c jobs of type 1, such that for each such job j, Tj(x) = ∞, for x < K and
Tj(K) = 1. There are bc jobs of type 2, such that for each such job j, Tj(x) = bc if x < Kc and
Tj(x) = 1 if x ≥ Kc .
Consider the following solution. The static job assignment runs b jobs of type 2 on each
core. After b time units, it runs the c jobs of type 1 on core 1. The dynamic cache partition
starts with each core getting Kc cache. The cache partition changes after b time units and core
1 gets all the cache. This solution has a makespan of b+ c and therefore M(ODS) ≤ b+ c.
There is an optimal static cache partition and static job assignment that allocates to each
core 0, Kc or K cache, because otherwise we can reduce the amount of cache allocated to a core
without changing the makespan of the solution. This implies that there are only two static cache
partitions that may be used by the optimal static solution: the partition in which p(i) = Kc for
each core i, and the partition that gives all the cache to a single core. It is easy to see that if
we use the cache partition where p(i) = Kc we get a solution with an infinite makespan because
of the jobs of type 1. Therefore this optimal static solution uses a cache partition that gives
all the cache to a single core. Given this partition, the optimal job assignment is to run all the
c jobs of type 1 on the core with all the cache, and assign to that core additional bc − (c − 1)
jobs of type 2. So the load on that core is bc + 1. Each of the c − 1 cores with no cache is
assigned exactly one job of type 2, and each such core has a load of bc. Therefore the ratio
M(OSS)M(ODS)
≥ bc+1b+c . The lower bound on this ratio approaches c as b approaches infinity. Since b is
an arbitrarily chosen constant, we can choose it large enough such that we get a lower bound
that is greater than c− ε, for any ε > 0.
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 43
Corollary 3.31. For any ε > 0 there is an instance of the joint cache partition and job assign-
ment problem, such that M(OSS) > (c− ε)M(ODD).
Proof. Consider the same instance as in the proof of Lemma 3.30. For that instance, M(OSS) >
(c − ε)M(ODS). It follows that M(OSS) > (c − ε)M(ODD) for the instance in Lemma 3.30 ,
since M(ODS) ≥M(ODD).
Lemma 3.32. For any instance of the joint cache partition and job assignment problem,
M(OSS) ≤ 2M(OSD).
Proof. Consider any instance of the joint cache partition and job assignment problem and let
OSD = (p, S). Let xij be the fraction of job j’s work unit that is carried out by core i. Formally,
xij = |t|(t,i)∈S−1(j)|Tj(p(i))
. Let’s consider the instance of scheduling on unrelated machines where
job j runs on core i in time Tj(p(i)). Since for every job j,c∑i=1
xij = 1 then xij is a fractional
assignment for that instance of the unrelated machines scheduling problem. The makespan of
this fractional solution is M(OSD). Let y be the optimal fractional assignment of the defined
instance of unrelated machines. We know that if we apply Lenstra’s algorithm [LST90] to this
instance, we get an integral assignment, denoted by z, such that the makespan of z is at most
twice the makespan of y and therefore at most twice the makespan of x. Assignment z is a
static job assignment and therefore (p, z) is a solution to the joint static cache partition and
static job assignment problem of our original instance, with makespan at most twice M(OSD).
It follows that M(OSS) ≤ 2M(OSD).
Lemma 3.33. For c ≥ 2, there is an instance of the joint cache partition and job assignment
problem such that M(OSS)M(OSD) = 2− 2
c .
Proof. Consider the following instance. There are c jobs, where each takes 1− 1c time regardless
of the cache allocation, and one job that takes 1 time unit, regardless of cache. The optimal
static schedule for this instance assigns two jobs of size 1 − 1c to the first core, assigns one job
of size 1 − 1c to each of the cores 2, . . . , c − 1, and assigns the unit sized job to the last core.
This yields a makespan of 2− 2c . The optimal dynamic assignment assigns one job of size 1− 1
c
fully to each core, and then splits the unit job equally among the cores, to yield a makespan of
1. Notice that this can be scheduled such that the unit job never runs simultaneously on more
than one core. This is achieved by running the ith fraction, of size 1c , of the unit job on core i at
time i−1c . The other jobs, that are fully assigned to a single core, are paused and resumed later,
CHAPTER 3. JOINT CACHE PARTITION AND JOB ASSIGNMENT 44
if necessary, to accommodate the fractions of the unit sized job. Therefore in this instance the
ratio M(OSS)M(OSD) is exactly 2− 2
c .
Chapter 4
Static partitions under bijective
analysis
We consider a model where c cores share a cache of size K. Each core has a single job to perform
that is defined by a sequence of page requests. We assume that the sequences of all cores are
of the same length. Core i requests pages from a working set Wi of possible pages. We assume
that for any two cores i and j, Wi and Wj are disjoint. The cache is statically partitioned
between the cores according to a cache partition p. Core i has p(i) cache allocated to it which
is populated by a subset of Wi of size p(i). Core i serves its page requests according to their
order in the core’s sequence. For each page request by core i, if the requested page is currently
in the cache then the core i has a cache hit and otherwise it has a cache miss. When core i has
a cache miss on page x, it can replace one of its currently cached pages by x. We denote by E
an eviction policy that decides for each core and each cache miss, which cached page, if any, to
replace. We study this problem in an online setting where the request sequences are not known
in advance. An algorithm for the problem is a pair (p,E) where p is a static cache partition and
E is an eviction policy. The goal is to find an algorithm that minimizes the maximum number
of cache misses for any core.
Bijective analysis [ADLO07] is a technique to directly compare online algorithms. Bijective
analysis defines that online algorithm A is at least as good as online algorithm B if there is a
bijection π that maps any input s to an input π(s) of the same length, such that for any input
s, A(π(s)) ≤ B(s).
Theorem 4.3, which gives the main result of this section, show that if |Wi| = |Wj | for any i, j
45
CHAPTER 4. STATIC PARTITIONS UNDER BIJECTIVE ANALYSIS 46
and that if the set of possible inputs of length n isc∏i=1
Wni , the set of all possible combinations of c
sequences of length n such that the pages in sequence i are in Wi, then any online algorithm that
partitions the cache equally among the cores is as least as good as any other online algorithm,
regardless of the eviction policies.
Bijective analysis was first introduced in [ADLO07] in the context of online algorithms for
the single-core paging problem. Bijective analysis can differentiate between different algorithms
that may appear to be equal under competitive analysis [ST85]. Angelopoulos and Schwitzer
[AS09] used bijective analysis to prove the optimality of Least-Recently-Used as a page eviction
policy for single core caching, assuming locality of reference. In contrast, the competitive ratio
of LRU is equivalent to a wider class of marking algorithms [KMRS88], some of which have
a poor performance in practice. This optimality result is strongly suggested by experimental
work on caching but is not captured by competitive analysis.
Definition 4.1. Let A and B be two online algorithms for a minimization problem and let In
denote all possible inputs of size n. Algorithm A is at least as good as algorithm B, on inputs of
size n, if there is a bijection π : In → In such that for any s ∈ In, A(π(s)) ≤ B(s). We denote
this by A n B.
Directly showing the bijection, as required by Definition 4.1, can be difficult in various
scenarios. Theorem 4.2 provides a technique to show the existence of such bijections using
stochastic dominance. Stochastic dominance was previously used to measure performance of
online algorithms in [HV08] and applied to the online bin coloring problem. The following result
shows the equivalence of bijective analysis and stochastic dominance analysis.
Theorem 4.2. Let A and B be two algorithms for a minimization problem. Let In be the set
of all inputs of size n. Then A n B if and only if for any x, Pr(A(s) ≤ x) ≥ Pr(B(s) ≤ x)
assuming s is uniformly sampled from In.
Proof. Assume A n B. There is a bijection π such that for any s ∈ In, A(π(s)) ≤ B(s). For
any algorithm Alg, let IAlg(x) = s ∈ In | Alg(s) ≤ x.
For any s ∈ IB(x), A(π(s)) ≤ B(s) ≤ x and therefore π(s) ∈ IA(x). This shows that
π(IB(x)) ⊆ IA(x) and therefore |IA(x)| ≥ |π(IB(x))| = |IB(x)|. Assuming s is uniformly
sampled from In we get Pr(A(s) ≤ x) = |IA(x)||In| ≥
|IB(x)||In| = Pr(B(s) ≤ x).
For the other direction, assume that for every x, Pr(A(s) ≤ x) ≥ Pr(B(s) ≤ x), assuming
s is uniformly sampled from In. This implies that |IA(x)| ≥ |IB(x)|, for every x.
CHAPTER 4. STATIC PARTITIONS UNDER BIJECTIVE ANALYSIS 47
Let c1 < c2 < . . . < cb be the possible cost values returned by algorithm B for inputs
in In. We build a bijection π from In to In as follows. We start by selecting an arbitrary
injection π1 from IB(c1) to IA(c1). Let Q1 = IA(c1) \ π1(IB(c1)) be the set of unmatched
elements in IA(c1). In the ith step, 2 ≤ i ≤ b, we select an arbitrary injection πi of the
elements in IB(ci) \ IB(ci−1) to elements in Qi−1 ∪ (IA(ci) \ IA(ci−1)). We then set Qi =
(IA(ci) \ (π1(IB(c1))) \i⋃
j=1πj(IB(cj) \ IB(cj−1)), the set of the unmatched elements in IA(ci).
Let π be the bijection from In to In such that π(x) = πi(x) for any x such that B(x) = ci.
It is easy to see that this is a well defined bijection, if we can find injections πi for any i. For
any i, we show that the size of the domain of πi is at most the size of the range allowed by this
algorithm. For i = 1, we know that |IB(c1)| ≤ |IA(c1)| because of the stochastic dominance
assumption. For i ≥ 2, the size of the domain of πi is |IB(ci) \ IB(ci−1)| = |IB(ci)|−|IB(ci−1)| ≤
|IA(ci)|− |IA(ci−1)|+ |IA(ci−1)|− |IB(ci−1)| = |IA(ci) \ IA(ci−1)|+ |Qi−1| which is the size of the
allowed range. Therefore this algorithm is able to injectively match, in each step i, the elements
in IB(ci) \ IB(ci−1) to elements in Qi−1 ∪ (IA(ci) \ IA(ci−1)).
Let s ∈ In and let i be such that B(s) = ci. By π’s construction, we know that π(s) ∈ IA(ci)
and therefore A(π(s)) ≤ ci = B(s). The existence of π implies that A n B.
We now prove our main result, under the assumption that for any core i, |Wi| = mK.
Theorem 4.3. Let p′ be the cache partition that allocates cache equally to the cores. Let p be
any other cache partition. Let E and E′ be any two page eviction policies. Then for any n,
(p′, E′) n (p,E).
Proof. We assume w.l.o.g that for each i, p(i) is the fraction of the cache of size K allocated to
core i and thereforec∑i=1
p(i) = 1.
Let In be the set of all c request sequences, each of length n, and assume we uniformly
sample an input from In. This implies that each page request, for each core, is uniformly and
independently selected from Wi. Thus for any cache partition p and any eviction policy E, the
probability of any page request of core i to be a cache hit is p(i)m . Therefore the number of cache
misses of core i is a binomial random variable, Xi = Binom(1− pim , n), with success probability
1 − pim and n trials. Let X = maxiXi. For any cache partition p and for any 0 ≤ b ≤ n, let
Fp(b) = Pr(X(s) ≤ b|p) denote the cumulative distribution function of X at point b.
For b = n, Fp(n) = 1 for any p, and in particular Fp(b) is maximized by p = p′. For b = 0,
Fp(0) =c∏i=1
(p(i)m
)n= 1
mnc
(c∏i=1
p(i)
)mand it is easy to see that Fp(0) is maximized by p = p′.
CHAPTER 4. STATIC PARTITIONS UNDER BIJECTIVE ANALYSIS 48
The set of all p’s such thatc∑i=1
p(i) = 1 is closed and bounded and Fp(b) is a continuous function
of p, and thus the maximum of Fp(b) is attained by some cache partition. We show that for any
b ∈ 1, . . . , n − 1, any partition p 6= p′ does not maximize Fp(b). It then follows that Fp(b) is
maximized by p = p′. By Theorem 4.2, this implies the optimality of any algorithm that uses
cache partition p′, regardless of the eviction policy.
Let p 6= p′ be a cache partition. We assume that for every i, p(i) 6= 0 as otherwise X = n
and any cache partition with non-zero cache allocations is better than p.
We can represent the cumulative distribution function of Xi for any i as a regularized
incomplete beta function.
Pr(Xi ≤ b | p(i)) = (n− b)(n
b
) p(i)m∫0
tn−b−1(1− t)bdt
And therefore:
Fp(b) = Pr(X ≤ b|p) =c∏l=1
Pr(xl ≤ b|p(l)) =
(n
b
)c(n− b)c
c∏l=1
p(l)m∫0
tn−b−1(1− t)bdt (4.1)
Since p 6= p′ there are i, j such that p(i) < p(j). To simplify notation we denote α =
p(i) + p(j), q = p(i) and p(j) = α − q. We know that q < α2 . We are interested in how Fp(b)
changes if we increase q, and keep the cache allocated to any core other than i, j fixed. There
are two terms in the product in Equation 4.1 that depend on q, the term for core i and the term
for core j and thus if we denote C =(nb
)c(n−b)c
∏l 6∈i,j
p(l)m∫0
tn−b−1(1− t)bdt, which is independent
of q, we get that:
∂F
∂q= C
[1
m
( qm
)n−b−1 (1− q
m
)b α−qm∫0
tn−b−1(1− t)bdt
− 1
m
(α− qm
)n−b−1(1− (α− q)
m
)b qm∫0
tn−b−1(1− t)bdt
]
We show that this derivative is positive, which implies that we can increase Fp(b) by in-
creasing q. Since b < n, C is a positive constant, and clearly 1m is a positive. By ignoring both
CHAPTER 4. STATIC PARTITIONS UNDER BIJECTIVE ANALYSIS 49
constants and substituting x = qtα−q in the first integral we get:
( qm
)n−b−1 (1− q
m
)b qm∫0
(x(α− q)
q
)n−b−1(1− x(α− q)
q
)b (α− q)q
dx
−(α− qm
)n−b−1(1− (α− q)
m
)b qm∫0
tn−b−1(1− t)bdt
Rearranging we get:
(α− q)q
(α− qm
)n−b−1 (1− q
m
)b qm∫0
xn−b−1(
1− x(α− q)q
)bdx
−(α− qm
)n−b−1(1− (α− q)
m
)b qm∫0
tn−b−1(1− t)bdt
Dividing both operands by the positive value(α−qm
)n−b−1, we get that it suffices to prove
that:
(α− q)q
(1− q
m
)b qm∫0
xn−b−1(
1− x(α− q)q
)bdx >
(1− (α− q)
m
)b qm∫0
tn−b−1(1− t)bdt (4.2)
Since the integration boundaries are the same for both sides of the inequality, it suffices to
prove
(α− q)q
(1− q
m
)bxn−b−1
(1− x(α− q)
q
)b≥(
1− (α− q)m
)bxn−b−1(1− x)b
for any x ∈ [0, qm ], with strict inequality for x ∈ (0, qm). Since (α−q)q > 1, it suffices to prove
(1− q
m
)bxn−b−1
(1− x(α− q)
q
)b≥(
1− (α− q)m
)bxn−b−1(1− x)b (4.3)
If x = 0 we clearly get an equality. Dividing the inequality by xn−b−1 and taking the b-th
root from both sides, we get that is sufficient to show that:
(1− q
m
)(1− x(α− q)
q
)≥(
1− (α− q)m
)(1− x) (4.4)
CHAPTER 4. STATIC PARTITIONS UNDER BIJECTIVE ANALYSIS 50
Which is equivalent to:
x
((α− 2q)
q
)≤ (α− 2q)
m
This last inequality strictly holds for any x < qm and there is an equality for x = q
m .
Therefore, any partition p 6= p′ does not maximize Fp(b) and this concludes the proof that
the cache partition p′ maximizes Fp(b) for any b and thus proves the theorem.
A natural extension of this result is to consider cores of arbitrary working set sizes. That
is, the case where |Wi| and |Wj | may be be different for different cores i 6= j. An intuitive
conjecture is that the partition p(i) = |Wi|c∑j=1|Wj |
is the optimal cache partition under bijective
analysis.
This conjecture turns out to be false. Fixing the problem’s parameters n,K and the |Wi|’s,
we can directly compute Pr(X ≤ b | p) for any b and p by using the representation of the
cumulative distribution function of each binomial random variable as a regularized incomplete
beta function. By computationally going over all 0 ≤ b ≤ n and all cache partitions, we found
that for different b’s, Pr(X ≤ b | p) is maximized by a different partition pb. This means that no
static cache partition stochastically dominates all other partitions, and thus there is no optimal
static partition under bijective analysis when cores have working sets of different sizes.
For example, consider the case of c = 3, K = 30, |W1| = 40, |W2| = 60, |W3| = 80 and the
length of the request sequences n = 100. Figure 4.1 shows that no single partition maximizes
the cumulative distribution function for all values of b.
Looking at several similar figures for different parameters gives rise to the following conjec-
ture.
Conjecture 4.4. Let pb be the cache partition that maximizes Pr(X ≤ b | pb). Then for any i,
pb(i) is between 1c and |Wi|
c∑j=1|Wj |
.
Notice that this conjecture is a generalization of Theorem 4.3. The proof of Theorem 4.3
is based on showing that for any partition that does not allocate the cache equally among the
cores, there is a pair p(i) and p(j) such that by changing them while maintaining their sum
fixed we can increase Pr(X ≤ b | p) for any b.
One may try to prove Conjecture 4.4 by a similar approach. That is, by showing that we
can improve a partition by locally changing p(i) and p(j) for some two cores i, j such that at
least one of them violates the condition in the conjecture. Unfortunately, we can show that this
CHAPTER 4. STATIC PARTITIONS UNDER BIJECTIVE ANALYSIS 51
Figure 4.1: No partition maximizes the cumulative distribution function for any b. Consider c = 3,K = 30, |W1| = 40, |W2| = 60, |W3| = 80 and the length of the request sequences n = 100. For eachvalue of b, the three line indicate the cache partition that maximizes Pr(X ≤ b | p). The red line is p(3),the green line is p(2) and the blue line is p(1).
approach fails. Assume for example, that K = 30, c = 3, n = 100 and |W1| = 50, |W2| = 40
and |W3| = 10. Let p be the partition p(1) = 0.42, p(2) = 0.41 and p(3) = 0.17. It is easy to see
that p does not meet the criteria of Conjecture 4.4. Specifically, p(2) is too large and it is the
only cache allocation in p that is out of the range specified by Conjecture 4.4. If we consider
the pair p(2) and p(3), fix their sum and consider the derivative of Pr(X ≤ b | p) as a function
of p(2) we get that for b ≥ 85 the derivative is positive. Similarly, if we consider the pair p(2)
and p(1), we get that the derivative of Pr(X ≤ b | p) as a function of p(2) is positive for b ≤ 52.
This leads to two possible future research directions for proving Conjecture 4.4:
1. Assuming partition p does not meet the criteria of the conjecture, can we show that for any
b there is a pair p(i(b)) and p(j(b)) such that while maintaining p(i(b))+p(j(b)) we can increase
Pr(X ≤ b | p) and reach a partition where both p(i(b)) and p(j(b)) are closer to the required
ranges.
2. Can we find a more global way to modify the partition p such that in some metric it gets
closer to satisfy Conjecture 4.4 and Pr(X ≤ b | p) increases, for any b.
Chapter 5
Cache partitions in the speed-aware
model
As in the previous section, we assume that each core has a single job to perform. However,
rather then specifying the job by a request sequence, we specify the job by a speed function
vi(a) that indicates the speed in which core i progresses through its job if it is allocated a cache
pages. We assume that for any core i, vi(a) is a non-decreasing function of a. Our goal is the
find the cache partition for which the speed of the slowest core is maximized.
To motivate the assumption that the cores’ speed functions are known, we consider the
following two scenarios:
• Consider an offline multi-core caching problem where each core has a sequence of page
requests to serve. Assume that a cache hit takes 1 time unit and a cache miss takes τ > 1
time units. Assume also that each core statically populates its allocated cache of size a
with the optimal subset of pages, that is the set of the a most frequent pages in the core’s
sequence. Let fi(a) denote the sum of the frequencies of these a most frequent pages for
core i. Then the speed vi(a) = 1τ (1−fi(a))+fi(a). Note that given the requests sequences
we can compute vi(a) for any i and a.
• Speeds are also available in a probabilistic model, where the pages requested by each
core are drawn independently from a given distribution on its working set. Let qi(x) be
the probability of core i requesting page x. Assume, w.l.o.g, that the pages of core i
are indexed such qi(x) is a non-increasing function of the page x. It is clear that if core
i has a cache pages allocated to it, the optimal pages to place in the cache are the a
52
CHAPTER 5. CACHE PARTITIONS IN THE SPEED-AWARE MODEL 53
pages with the highest probabilities according to qi, even if it can dynamically change the
contents of its cache. If core i is allocated a pages in the cache then its expected speed is:
vi(a) = 1τ
wi∑x=a+1
qi(x) +a∑
x=1qi(x).
5.1 Finding the optimal static partition
The following greedy algorithm finds the optimal static partition.
Algorithm 1 Greedy Static Partition
for all i = 1, . . . , c do ai = 0 . start cores with no cache at allend forfor i = 1, . . . ,K do
j = arg minj vj(aj) . find the currently slowest coreaj = aj + 1 . give it an additional cache page
end for
Lemma 5.1. Algorithm 1 generates the optimal static cache partition.
Proof. Let p be the cache partition generated by Algorithm 1. Assume to the contrary that
there is a partition ψ such that the slowest core under partition ψ is faster than the slowest core
under partition p. Let i be the index of the slowest core under partition p. By the selection
of ψ we know that vi(p(i)) < vi(ψ(i)). Since vi is non-decreasing this implies that ψ(i) > p(i).
Since both partitions allocate a total of K cache we know that there is another core j such that
ψ(j) < p(j). Let’s consider the iteration in which Algorithm 1 gives core j its (ψ(j) + 1) cache
page. Core j’s speed at the beginning of that iteration is vj(ψ(j)) and it is at most the speed
of core i, at that iteration. Since in Algorithm 1 the speeds of the cores only increase as more
cache pages are allocated, it follows that vi(p(i)) ≥ vj(ψ(j)). This contradicts the selection of
ψ.
5.2 Variable cache partitioning
In this section we consider a variable cache partition. Let β = β(K, c) =(K−1c−1)
denote the
number of possible cache partitions. Let p1, . . . pβ be all possible cache partitions. A variable
cache partition is a distribution x1, . . . , xβ over p1, . . . pβ. We are given the core’s speed functions
v1, . . . , vc and our goal is the find a variable cache partition that maximizes the expected speed
of the slowest core. We assume that for any core, allocating more cache has a marginally
decreasing benefit. Formally, we assume for any i, vi is non-decreasing and concave. Note
CHAPTER 5. CACHE PARTITIONS IN THE SPEED-AWARE MODEL 54
that in both scenarios described in the beginning of Chapter 5, the vi’s are non-decreasing and
concave.
The following example shows that the expected speed of the slowest core in a variable
cache partition can in fact be better than the speed of the slowest core in the optimal static
cache partition. We consider a probabilistic model which, as we previously established, is a
special case of the speed-aware model. Assume there are 2 cores and a cache of size 3. Each
core accesses one of two pages, each with probability 12 . In this case the speed functions are
v1(1) = v2(1) = 12 + 1
2τ , v1(2) = v2(2) = v1(3) = v2(3) = 1. The optimal static partition gives 2
pages to one core and gives 1 page to the other core. The speed of the slowest core under this
partition is 12 + 1
2τ . Let j denote the index of the above mentioned static partition and let j′
denote index of the symmetric cache partition (switch the core that gets 2 pages with the one
that gets one page) then the optimal variable cache partition is xj = xj′ = 12 and xl = 0 for any
l 6∈ j, j′. Under this variable cache partition, both cores have an expected speed of 14τ + 3
4
which is better than the speed of the slowest core in the optimal static partition.
Let pj denote the jth cache partition. Each cache partition pj corresponds to a vector
of cores’ speeds v1(pj(1)), . . . , vc(pj(c)). To simplify notation, we denote vi,j = vi(pj(i)), and
the vector of speeds is v1,j , . . . , vc,j . The optimal variable cache partition is a solution of the
following linear program:
Maximize λ subject to
∀1 ≤ i ≤ c,β∑j=1
vi,jxj ≥ λ
β∑j=1
xj = 1
and λ ≥ 0,∀1 ≤ j ≤ β, xj ≥ 0
The first c constraints ensure that the expected speed of each core is at least λ and the last
constraint ensures that the variable cache partition is a distribution. The number of variables in
this linear program is the number of cache partitions β = Θ(Kc), and the number of constraints
is c+ 1.
CHAPTER 5. CACHE PARTITIONS IN THE SPEED-AWARE MODEL 55
Consider the dual linear program:
Minimize z subject to
∀1 ≤ j ≤ βc∑i=1
vi,jyi ≤ zc∑i=1
yi = 1
and z ≥ 0, ∀1 ≤ i ≤ c, yi ≥ 0
In the dual linear program the variable yi represents the weight of core i. We define the
weighted speed of cache partition pj , given core weights y1, . . . , yc, asc∑i=1
vi,jyi, which is the
weighted average of the speeds of the cores under partition pj . A solution of the dual program
are weights y1, . . . , yc that minimize the weighted speed of the fastest cache partition.
This linear program has c+1 variables and an exponential number of inequalities. In [GLS81]
Grotschel et al proposed a technique to optimize a linear program with an exponential number
of inequalities in polynomial time, independent of the number of inequalities. The technique is
based on Khachiyan’s ellipsoid method [Kha79] that does not require the explicit inequalities to
be available but instead to have a separation oracle for the solution set of the linear program. A
thorough review of linear programming, including the ellipsoid and separation oracle techniques,
can be found in [Sch98].
Definition 5.2. A separation oracle for a linear program of n variables is an algorithm that
given x ∈ Rn decides if x is a feasible solution of the linear program or finds a constraint that
is violated by x.
The following theorem is proved by [GLS81]:
Theorem 5.3. Let Q be a solution set of linear inequalities Q = x | Ax ≤ b ⊂ Rn and d ∈ Rn
the optimization direction, i.e. maximizing dTx for x ∈ K. Let T be the maximal bit-encoding
length of any value in A and b. Let S be the maximal bit-encoding length of any value in d.
Then:
Given a separation oracle for Q that for every x ∈ Rn runs in time ψ, which is polynomial in n
and T , and returns a violated constraint of encoding length polynomial in n and T , the ellipsoid
method can optimize the linear program in time polynomial in ψ, n, T and S.
A separation oracle for our dual linear program is an algorithm that given a point in Rc+1,
which is composed of y ∈ Rc and z ∈ R, determines if any inequality is violated by y, z and if
CHAPTER 5. CACHE PARTITIONS IN THE SPEED-AWARE MODEL 56
so returns one such violated inequality. A violated inequality corresponds to a cache partition
whose weighted speed with respect to y is greater than z. Our separation oracle checks the last
constraint explicitly and if it holds it then computes the fastest cache partition. If the weighted
speed of this partition is at most z then y, z is a feasible solution to the dual linear program.
Otherwise, the fastest cache partition defines a violated inequality that our separation oracle
returns.
Our separation oracle computes the fastest cache partition as follows. Let ∆vi(x + 1) =
vi(x+ 1)− vi(x) be the speed increase for core i if it is given an additional cache page on top of
the x pages already allocated to it. We start with all the cores having no cache, that is p(i) = 0,
for any i ∈ 1, . . . , c. We find core i = arg max1≤i≤c yi∆vi(p(i) + 1) and give it another cache
page, p(i) := p(i) + 1. We repeat this step K times.
After all the cache pages are allocated to the cores, we check if the weighted speed of
the resulting cache partition,c∑i=1
yivi(p(i)), is greater than z and if so, we return the inequality
corresponding to this partition as a violated inequality. Otherwise, our separation oracle decides
that the given point is a feasible solution of the dual linear program.
Theorem 5.4. The cache partition generated by the above process is the fastest cache partition,
with respect to core weights y.
Proof. Let p be the cache partition generated by the above process. Assume to the contrary
that it is not the fastest with respect to y. Let ψ be a cache partition of the fastest weighted
speed with respect to y. We further select ψ among all partitions of the fastest weighted speed
as one that minimizesc∑i=1| ψ(i)− p(i) |, the sum of absolute distances from p.
Let i be the first index such that for any i′ < i, ψ(i′) = p(i′) and ψ(i) 6= p(i). We assume
w.l.o.g that p(i) < ψ(i) (otherwise we can rename the cores). Since both partitions allocate a
total of K pages, there is a core j > i such that p(j) > ψ(j).
Consider the iteration in which our algorithm decides to give core j its (ψ(j) + 1) page. Let
φ be the cache partition our algorithm has before this iteration. We know that φ(j) = ψ(j) and
p(i) ≥ φ(i). Since vi is a concave non-decreasing function we get that ∆vi is a non-increasing
function. Thus ∆vi(φ(i)) ≥ ∆vi(p(i)) ≥ ∆vi(ψ(i)). Since our algorithm decides in this iteration
to give the next cache page to core j we know that yj∆vj(ψ(j) + 1) ≥ yi∆vi(φ(i) + 1) ≥
yi∆vi(p(i) + 1) ≥ yi∆vi(ψ(i)). This inequality implies that if we take partition ψ and move one
page from core i to core j, it will not decrease the weighted speed of the partition, as moving
CHAPTER 5. CACHE PARTITIONS IN THE SPEED-AWARE MODEL 57
this page changes the weighted speed by yj∆vj(ψ(j) + 1) − yi∆vi(ψ(i)) ≥ 0. If this difference
is positive, it means that we have generated a faster partition which contradicts the choice of
ψ. Therefore we obtained a new partition whose weighted speed is same as ψ’s. Notice that
the sum of absolute distances between this partition and p is smaller then the sum of absolute
distances between ψ and p. This contradicts the selection of ψ and thus concludes the proof
that p is of the fastest weighted speed with respect to y.
Our separation oracle builds the fastest cache partition by sequentially allocating the K
cache pages. It allocates each page by comparing the speed improvement gained by each of
the c alternatives, which involves arithmetic operations on the core speeds and the given query
point y ∈ Rc. Thus the separation oracle’s time complexity involves O(Kc) steps, each taking a
time polynomial in the bit-encoding length of the given query point and the bit-encoding length
of the core speeds. The violated constraint that is returned by this separation oracle is always
a speeds vector of some cache partition, and thus its bit-encoding length is polynomial in c and
the encoding length of the core speeds.
Going back to theorem 5.3 we can see that we have a linear program in Rc with upper
bound T on the encoding length of individual elements in the constraints and an optimization
direction of constant encoding length. Our separation oracle runs in polynomial time in K, c
and in the speeds encoding length and returns violated constraints whose encoding length is
polynomial in c and the encoding length of the core speeds. Thus theorem 5.3 provides us with
an algorithm to solve this dual linear program, and the variable cache partition problem, in
time polynomial in K, c and the speeds encoding length.
Bibliography
[ADLO07] S. Angelopoulos, R. Dorrigiv, and A. Lopez-Ortiz. On the separation and equiva-
lence of paging strategies. In SODA, pages 229–237, 2007.
[AS09] S. Angelopoulos and P. Schweitzer. Paging and list update under bijective analysis.
In SODA, pages 1136–1145, 2009.
[Bel66] L. A. Belady. A study of replacement algorithms for a virtual-storage computer.
IBM Syst. J., 5(2):78–101, June 1966.
[BGV00] R. D. Barve, E. F. Grove, and J. S. Vitter. Application-controlled paging for a
shared cache. SIAM J. Comput., 29(4):1290–1303, February 2000.
[BW12] Vincenzo Bonifaci and Andreas Wiese. Scheduling unrelated machines of few dif-
ferent types. CoRR, abs/1205.0974, 2012.
[CH73] V. Chvatal and P. L. Hammer. Set-packing problems and threshold graphs. Tech-
nical Report CORR 73-21, Dep. of Combinatorics and Optimization, Waterloo,
Ontario, 1973.
[CPIM05] A. M. Campoy, I. Puaut, A. P. Ivars, and J. V. B. Mataix. Cache contents selection
for statically-locked instruction caches: An algorithm comparison. In Proceedings
of the 17th Euromicro Conference on Real-Time Systems, ECRTS ’05, pages 49–56,
Washington, DC, USA, 2005. IEEE Computer Society.
[Dre07] U. Drepper. What every programmer should know about memory, 2007.
http://people.redhat.com/drepper/cpumemory.pdf.
[EKS08] Tomas Ebenlendr, Marek Krcal, and Jirı Sgall. Graph balancing: a special case
of scheduling unrelated parallel machines. In Proceedings of the nineteenth an-
58
BIBLIOGRAPHY 59
nual ACM-SIAM symposium on Discrete algorithms, SODA ’08, pages 483–490,
Philadelphia, PA, USA, 2008. Society for Industrial and Applied Mathematics.
[EL11] L. Epstein and A. Levin. Scheduling with processing set restrictions: PTAS results
for several variants. Int. J. Prod. Econ., 133(2):586 – 595, 2011.
[FPT07] H. Falk, S. Plazar, and H. Theiling. Compile-time decided instruction cache
locking using worst-case execution paths. In Proceedings of the 5th IEEE/ACM
international conference on Hardware/software codesign and system synthesis,
CODES+ISSS ’07, pages 143–148, New York, NY, USA, 2007. ACM.
[GLS81] M. Grotschel, L. Lovasz, and A. Schrijver. The ellipsoid method and its conse-
quences in combinatorial optimization. Combinatorica, 1(2):169–197, 1981.
[Gra69] R. L Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on
Applied Mathematics, 17:416—429, 1969.
[Has10] A. Hassidim. Cache replacement policies for multicore processors. In ICS, pages
501–509, 2010.
[HS88] D. S. Hochbaum and D. B. Shmoys. A polynomial approximation scheme for
scheduling on uniform processors: Using the dual approximation approach. SIAM
J. Comput., 17(3):539–551, 1988.
[HV08] Benjamin Hiller and Tjark Vredeveld. Probabilistic analysis of online bin coloring
algorithms via stochastic comparison. In ESA, pages 528–539, 2008.
[Ira96] S. Irani. Competitive analysis of paging: A survey. In In Proceedings of the
Dagstuhl Seminar on Online Algorithms, Dagstuhl, 1996.
[Kha79] L. G. Khachiyan. A polynomial algorithm in linear programming. Doklady
Akademii Nauk SSSR, 244:1093–1096, 1979.
[KMRS88] A. R. Karlin, M. S. Manasse, L. Rudolph, and D. D. Sleator. Competitive snoopy
caching. Algorithmica, 3:77–119, 1988.
[LLD+08] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights
into multicore cache partitioning: Bridging the gap between simulation and real
systems. In HPCA, pages 367–378, 2008.
BIBLIOGRAPHY 60
[LLX12] T. Liu, M. Li, and C. J. Xue. Instruction cache locking for multi-task real-time
embedded systems. Real-Time Systems, 48(2):166–197, 2012.
[LOS12] A. Lopez-Ortiz and A. Salinger. Paging for multi-core shared caches. In ITCS,
pages 113–127. ACM, 2012.
[LST90] J. K. Lenstra, D. B. Shmoys, and Eva Tardos. Approximation algorithms for
scheduling unrelated parallel machines. Math. Program., 46:259–271, 1990.
[LX09] M. Liu, T.and Li and C. J. Xue. Minimizing WCET for real-time embedded
systems via static instruction cache locking. In Proceedings of the 2009 15th IEEE
Symposium on Real-Time and Embedded Technology and Applications, RTAS ’09,
pages 35–44, Washington, DC, USA, 2009. IEEE Computer Society.
[LZLX10] T. Liu, Y. Zhao, M. Li, and C. J. Xue. Task assignment with cache partitioning
and locking for WCET minimization on MPSoC. In ICPP, pages 573–582. IEEE
Computer Society, 2010.
[LZLX11] T. Liu, Y. Zhao, M. Li, and C. J. Xue. Joint task assignment and cache partition-
ing with cache locking for WCET minimization on MPSoC. J. Parallel Distrib.
Comput., 71(11):1473–1483, 2011.
[MCHvE06] A. M. Molnos, S. D. Cotofana, M. J. M. Heijligers, and J. T. J. van Eijndhoven.
Throughput optimization via cache partitioning for embedded multiprocessors. In
ICSAMOS, pages 185–192, 2006.
[MP95] N. V. R. Mahadev and U. N. Peled. Threshold graphs and related topics, volume 56
of Annals of Discrete Mathematics. Elsevier, 1995.
[Sch98] Alexander Schrijver. Theory of Linear and Integer Programming. Wiley, June
1998.
[ST85] D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging
rules. Commun. ACM, 28(2):202–208, 1985.
[SV05] E. V. Shchepin and N. Vakhania. An optimal rounding gives a better approxi-
mation for scheduling unrelated machines. Oper. Res. Lett., 33(2):127–133, March
2005.
BIBLIOGRAPHY 61
[VLX03] X. Vera, B. Lisper, and J. Xue. Data cache locking for higher program predictabil-
ity. In Proceedings of the 2003 ACM SIGMETRICS international conference on
Measurement and modeling of computer systems, SIGMETRICS ’03, pages 272–
282, New York, NY, USA, 2003. ACM.