Exploring Computational Sprinting in New Domains

Exploring Computational Sprinting in New Domains

A Thesis

Presented in Partial Fulfillment of the Requirements for the DegreeMaster of Science in the Graduate School of The Ohio State

University

By

Indrajeet Saravanan, B.Tech.

Graduate Program in Computer Science and Engineering

The Ohio State University

2019

Master’s Examination Committee:

Christopher Charles Stewart, Advisor

Radu Teodorescu

c© Copyright by

Indrajeet Saravanan

2019

Abstract

The dawn of dark silicon and utilization wall are the main issues that current

processors face. Moore’s law is virtually dead due to the breakdown of Dennard scal-

ing. An array of novel approaches have been proposed to tackle the above-mentioned

issues and computational sprinting is the latest one to be advocated. Computational

sprinting is a set of management techniques that selectively speed up the execution

of cores for short intervals of time followed up idle periods to achieve improved per-

formance. This is physically feasible due to the inherent thermal capacitance that

absorbs the heat generated by a rise in operating frequency or voltage. In our pa-

per, we explore multiple avenues on how and where computational sprinting can be

used. Firstly, we apply the core scaling method to sprint web queries which even-

tually makes page loads faster. We observe a 5.86% and 12.6% decrease in average

load time for average-case and best-case scenarios respectively. Likewise, the number

of page loads increase by 12.12% (average-case) and 21.88% (best-case). Secondly,

we explore the Intel Cache Allocation Technology Tool to enable sprinting for server

workloads. Since toggling L3 cache capacity for workloads introduces interference

and uncertain consequences, we study the impact of cache stealing from co-located

workloads. With base and polluted cache state being 4 MB and 1 MB respectively,

we observe an average increase of 41.37 seconds in the runtime of Jacobi workload

for every 20% increase in interruption from other workloads. We propose a machine

ii

learning approach for future work. Finally, we study SLOs in practice to analyze

the realities and myths surrounding their design and use. We learn that single-digit

response goals are challenging, extreme percentiles for complex software is prohibited

by black swans, and find the parameters of importance for evaluating infrastructure

and complex cloud services.

iii

Dedicated to my family.

iv

Acknowledgments

I would like to start by thanking my advisor, Dr. Christopher Stewart, for his

support, guidance, and the countless moments of encouragement throughout my time

at Ohio State. I have been working with Dr.Christopher since the beginning of my

first year and could not have hoped for a better mentor. He masterfully guided me

through my research – providing direction when needed, while consistently giving me

the freedom to explore my own ideas. I would also like to thank Dr. Radu Teodorescu

for being a part of my thesis committee. I was inspired to pursue my research and

delve further into computer architecture after taking his class.

Furthermore, I want to thank the people who helped with the curation of this

thesis: Nathaniel Morris, Jianru Ding, and Ruiqi Cao. Nathaniel implemented a

large part of workload profiler and queue simulator. Jianru and Ruiqi did a great

deal of work on the study of SLOs.

And finally, I wouldn’t be where I am if not for my family. I owe this to each and

every one of them. They have always wanted the best for me and I couldn’t be more

grateful for such a wonderful support system. My parents, Shyamala and Saravanan,

and my sister, Haripriya, for their love and encouragement throughout my college

career.

v

Vita

2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B.Tech. Computer Science andEngineering, National Institute ofTechnology, Trichy

2017-2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Research Associate,The Ohio State University

Fields of Study

Major Field: Computer Science and Engineering

vi

Table of Contents

Page

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Computational Sprinting . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Dynamic Voltage Frequency Scaling . . . . . . . . . . . . . 21.1.2 Core Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2. Computational Sprinting in Browsers . . . . . . . . . . . . . . . . . . . . 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Sprinting Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Google Chrome Extension . . . . . . . . . . . . . . . . . . . 92.4.2 Zipf Distribution . . . . . . . . . . . . . . . . . . . . . . . . 112.4.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii

3. Computational Sprinting with Intel Cache Allocation Technology . . . . 30

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . 333.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4. Study on Service-Level Objectives . . . . . . . . . . . . . . . . . . . . . . 42

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Analysis and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 43

5. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 46

viii

List of Tables

Table Page

2.1 Results for computational sprinting on web queries. . . . . . . . . . . 28

ix

List of Figures

Figure Page

2.1 Dependency graph captured by traditional approaches (From [24]). . 7

2.2 Overview of different Navigation Timings involved as per the W3C spec. 10

2.3 Workflow of computational sprinting for web queries. . . . . . . . . . 12

2.4 Timeout for Sprinting vs Energy Consumed with budget of 500J. . . 16






2.10 Timeout for Sprinting vs Energy Consumed with budget of 30000J. . 19

2.11 Timeout for Sprinting vs Avg. Load Time with budget of 500J. . . . 20





x


2.17 Timeout for Sprinting vs Avg. Load Time with budget of 30000J. . . 23

2.18 Timeout for Sprinting vs Pages Loaded with budget of 500J. . . . . . 24

2.19 Timeout for Sprinting vs Pages Loaded with budget of 3000J. . . . . 25




2.23 Timeout for Sprinting vs Pages Loaded with budget as 7000J. . . . . 27

2.24 Timeout for Sprinting vs Pages Loaded with budget as 30000J. . . . . 27

3.1 With Intel CAT, systems software manages cache lines. . . . . . . . . 31

3.2 Implementation of workload profiling in our approach. . . . . . . . . . 33

3.3 Performance of Jacobi with 20 MB L3 cache. . . . . . . . . . . . . . . 35

3.4 Performance of K-means with 20 MB L3 cache. . . . . . . . . . . . . 35

3.5 Performance of BFS with 20 MB L3 cache. . . . . . . . . . . . . . . . 36

3.6 Performance of Backprop with 20 MB L3 cache. . . . . . . . . . . . . 36

3.7 Performance of CFD with 20 MB L3 cache. . . . . . . . . . . . . . . 37

3.8 Percentage of Interruption vs Avg. runtime for Jacobi with base cachestate as 4 MB and polluted cache state as 1 MB. . . . . . . . . . . . 39



xi

4.1 Heat map of delay and percentile goals from industry sources. . . . . 43

xii

Chapter 1: Introduction

1.1 Computational Sprinting

For the past 40 years, the chip industry followed Moore’s law— the number of

transistors on a processor doubled every 24 months. The main limitation that stunted

the growth of this trend is due to the breakdown of Dennard scaling [22, 41, 8]. This

law states that as the size of transistors decreases, power density remains constant. In

other words, for an increase in the count of transistors by a factor of λ2, the operating

voltage decreases by a factor of λ and clock frequency increases by a factor of λ. Due

to the chip’s unreliability to operate at low voltages, a halt to the increasing clock

frequency was brought [28]. The number of transistors that cannot be powered up

due to the limit on Thermal Design Power (TDP) has become a growing issue. This

phenomenon is called Dark Silicon [41, 23, 27]. The CTO of ARM, Mr. Mike Muller

announced that by the end of 2020, only 9% of the total transistors present in a

multi-core processor could be powered up at one time [19].

On the other hand, current applications are highly interactive [48, 30, 6, 35].

They require intense computation for brief intervals which are followed by extended

periods of idleness [27]. Examples of such applications include web browsers during

page loading [37, 31, 10], web search [29, 44], Internet services [40, 38, 36, 39], edge

driven IoT workloads and analytics [2] and route planning.

1

Computational Sprinting tackles the union of the above-mentioned issues [27].

It is a set of mechanisms which allow response time reduction at the expense of

additional power consumption. By boosting frequency and voltage for momentary

time spans [23, 13, 15], the idle cores are activated, in turn increasing the performance

of the chip. While sprinting, the chip generates heat faster than it can dissipate but

the rise in temperature is gradual. The thermal capacitance of the chip absorbs

the heat and does not allow a sudden spike in the chip’s temperature. After the

temperature hits the threshold value, sprinting is aborted by reducing the frequency

or powering down cores. The base cores take care of unfinished computation if present.

This technique helps in improving the responsiveness of the applications.

Some of the most common means of achieving computational sprinting are Dy-

namic Voltage Frequency Scaling (Voltage boosting and CPU throttling) and core

scaling. We will look at them in detail in the following subsections. This thesis does

not cover an widely used alternative: degrading the quality of answers momentar-

ily [17, 1, 25, 2, 46]. These approaches are beyond the scope of this thesis.

1.1.1 Dynamic Voltage Frequency Scaling

Dynamic Voltage Frequency Scaling (DVFS) is the mechanism in which the voltage

or frequency of the processor could be increased to expend more power and increase

performance or could be decreased to conserve power and decrease performance.

Voltage boosting refers to the process of modifying the supply voltage on the fly.

You can either undervolt or overvolt a processor. CPU throttling modifies the number

of instructions processed per cycle by the processor and hence, directly correlates to

2

the performance of the machine. You can over-clock or under-clock a processor based

on your needs.

Intel’s Turbo Boost technology achieves higher performance by running the active

cores of a processor at a frequency that is greater than the base operating frequency

[9]. The maximum frequency that Intel Turbo Boost can allow is constrained based

on the number of active cores, estimated power consumption, processor temperature,

and estimated current consumption. PUPiL, a hybrid power capping system focuses

on timeliness and efficiency and beats RAPL (Running Average Power Limits), Intel’s

software approach by 18% in mean performance [47]. Hsu et. al proposed Adrenaline

to address the shortage of techniques to improve tail latency for online data-intensive

services (OLDI) [15]. Adrenaline makes use of Short Stop, a circuit design that

enables fine-grained Voltage-Frequency scaling [26].

1.1.2 Core Scaling

Core Scaling is the technique of reducing latency by powering up more cores to

work on a specific task. This is a means by which we derive parallelism from the

hardware.

In most cases, core scaling and DVFS are combined to provide an efficient means

for sprinting. Haque et. al proposed Few-to-Many incremental parallelization, which

utilizes hardware and software parallelism to address tail latency by dynamically

applying resources based on the progress of each request [13].

The remainder of this paper is organized as follows. In Chapter 2, we learn the

effects of using computational sprinting for speeding web queries. In Chapter 3, we

employ Intel Cache Allocation Technology to manipulate the last-level cache as a

3

technique to realize computational sprinting. Chapter 4 studies the Service Level

Objectives employed in practice. Finally, Chapter 5 draws conclusions.

4

Chapter 2: Computational Sprinting in Browsers

2.1 Introduction

Mobile users expect webpages to load instantly. A recent survey reveals that 2

seconds is the average wait time for a webpage to load for more than 50% of internet

users. Failure to meet such high standards could result in user dissatisfaction and

a huge loss in revenue for service providers eventually. To put this in perspective,

Amazon could suffer a loss of $1.6 billion in sales per year for a mere delay of one

second in page load time [16]. The Google Search engine previously used page load

time as a key factor in ranking for desktop searches. Starting July 2018, Google

released ”Page Update” which focuses on mobile page speed to be a core ranking

factor in search results [43]. In spite of the significance of page load time, the collective

effort of the research community has fallen short to match user demands. There are

various factors that might affect the page load time of webpages. Substandard design,

network issues, server failure, etc. are a few common examples. The reality is that

most of these issues act in unison, and hence it is not trivial to precisely locate

or understand the complexity behind slow page loads. Based on past studies, the

following three reasons attribute to the majority of delay observed in load time [45].

1. Size of web pages: Unnecessary code that is transmitted that will only be

used based on user activity.

5

2. Disparity in computation and data transfer: Due to inefficient scheduling

and ordering in loading objects of a webpage, data transfer and processing don’t

take place concurrently.

3. Complex dependencies in page resources: Convoluted dependencies be-

tween objects prolong the time taken to reach the final state of a webpage.

To tackle all the above issues, we propose our approach that works on the user’s

device assuming best-effort services in other junctions. We employ computational

sprinting for specific web requests to abide by a predefined service level objective. By

tracking parameters such as arrival rate, queuing delay, energy budget, etc. and an

option to choose from multiple sprinting policies, sprinting is achieved. We collect

data by loading the most visited 100 pages from Alexa Top 500 Pages [42].

All of our results are collected on a Chrome browser as it is considered to be the

de-facto standard for browsers and serves as a platform for 62.41% of internet traffic

[34]. To keep track of page load time of each request, we designed a Chrome Extension

that employs the Navigation Timing API to record page metrics. Each web request

is associated with a process ID and by using OS tools, isolated-sprinting for discrete

pages is achieved. We find that by sprinting web queries, a 5.86% and 12.6% decrease

in load time is observed for average-case and best-case respectively. Correspondingly,

the energy consumption increases by 1.33% and 3.33% for average-case and best-case

respectively.

2.2 Related Work

There have been numerous undertakings to understand the factors that affect page

load time. These efforts have been implemented on the various layers present in the

6

life-cycle of a page load. Silo reduces the number of HTTP requests to load the page

by using JavaScript and DOM Storage [20]. This approach modifies the structure and

content of web pages to enable fine-grained caching. SPDY and QUIC are protocols

deployed in the network level to improve the performance of page load latencies [4,

33]. These methods help in minimizing the network costs but do not address the

substantial number of objects being fetched. At the computation level, solutions

such as ZOOMM and Adrenaline focus on parallelizing the requests to objects in

web pages [7, 18]. This approach fails to have a considerable impact because of the

dependencies present in webpages.

Figure 2.1: Dependency graph captured by traditional approaches (From [24]).

A website can be structured in the form of a dependency graph as shown in

Figure 2.1. The graph reveals dependency between modules and also edges that

could be processed in parallel. Browsers are provided with a partial dependency

graph which leads to constrained loading of objects fearing of violating dependencies.

Netravali et. al proposed Scout, a fine-grained dependency tracker that reveals finer

critical paths and Polaris, a dynamic client-side scheduler that takes advantage of

Scout to reduce page load time [24].

7

2.3 Sprinting Parameters

To discuss sprinting policies, we need to learn the terminology of input parameters.

1. Timeout: Decision of whether to sprint or not is made based on the value

of timeout. If the total execution time of a query crosses the timeout, then

sprinting is triggered.

2. Arrival Distribution: This parameter governs the rate at which queries come

into the system. Pareto, Exponential, and Gamma distributions could be used

for the same.

3. Sprint Budget: This signifies the amount of resources that are available that

could be used for sprinting. This can be represented in seconds or joules.

4. Service Rate: It is defined as the inverse of mean processing time for query

executions which aren’t sprinted. The most common unit in practice is queries

per hour. In statistics, µ is used to represent the service rate.

5. Arrival Rate: It is defined as the mean number of arrivals per unit time. It is

generally expressed as queries per second and represented with λ in statistics. It

is shown as a percentage of service rate. Inter-arrival time is the time between

subsequent arrival of queries.

6. Queuing delay: This is the mean amount of time queries spend in a queue

waiting for a slot to start execution. Computational sprinting creates an inter-

dependency between queuing time and processing time.

8

7. Processing Time: This is the total amount of time spent on processing a query

with computational resources. It includes both the time spent on processing the

query with and without sprinting.

8. Response Time: This is the summation of processing time and queuing time.

Based on the scenario, looking at the mean response time could be useful along

with its variance.

2.4 Design and Implementation

In this section, we will look at the modules that provide the ability to sprint page

loads based on policies.

2.4.1 Google Chrome Extension

Google Chrome is the leading web browser in the market amongst other alterna-

tives such as Mozilla Firefox, Microsoft Edge, etc. To make smart split decisions on

whether to sprint or not, keeping track of the load time is essential. Google Chrome

allows us to add functionality to the web browser by means of extensions. Extensions

could be downloaded from the Chrome Web Store or can be developed and deployed

by the user.

The design and structure of the extension is as follows:

root

background.js

content.js

icon.png

manifest.json

9

It is mandatory for every Chrome extension to have a manifest.json file. It provides

crucial details such as version, permissions, and URLs on which content scripts need

to run. The content.js file is a content script. A content script is a JavaScript file

that runs in the context of web pages. In other words, this file can interact with

the pages that are loaded in the browser. For the content script to interact with

other components of the extension, it uses Google’s message passing API. To enable

listening for other actions happening in the browser, we require the background.js

file. It’s a background script that interacts with the content script. The icon image

file is displayed on the extensions section on the browser.

Figure 2.2: Overview of different Navigation Timings involved as per the W3C spec.

The Navigation Timing API provides data that can be used to measure the perfor-

mance of websites. It has various interfaces such as Performance, PerformanceNav-

igationTiming, PerformanceNavigation, and PerformanceTiming. For our purpose,

we have used the PerformanceTiming interface to gather critical values like domain-

LookupStart, connectStart, etc.

10

We use Figure 2.2 to compute important metrics such as Total First Byte time,

Latency, DNS/Domain Lookup time, Server connect Time, Server Response Time,

Page Load time, Transfer/Page Download Time, DOM Interactive Time, DOM Con-

tent Load Time, DOM Processing to Interactive, DOM Interactive to Complete, and

Onload.

We employed XAMPP stack for collecting all the above metrics locally [11].

XAMPP is a web server solution stack that contains Apache HTTP Server, Mari-

aDB database, PHP and Perl. XMLHttpRequest (XHR) is used to transfer data

from the web browser to the SQL Web server. This server listens for GET requests.

Once the page is done loading, the data is passed by adding data to URL variables

and launching the GET request. All the above functionality is programmed in PHP.

2.4.2 Zipf Distribution

Zipf’s law states that given some corpus of natural language utterances, the fre-

quency of any word is proportional to its rank in the frequency table. A frequency

table is a table that displays the frequency of various outcomes in a sample.

Formally, let:

• N be the number of elements;

• k be their rank;

• s be the value of the exponent characterizing the distribution.

Zipf’s law then predicts that out of a population of N elements, the normalized

frequency of elements of rank k, f(k;s,N), is:

11

f(k; s,N) =1/ks∑N

n=1 (1/ns)(2.1)

From [21], we can learn that the trace of websites visited for an average user

follows the Zipf distribution.

Chrome browser withextension enabled

ALEXA TOP 100 google.com

facebook.com youtube.com

Zipf Distribution LOGGER time = time +1

energy = energy + 1

Queriesgoogle.com

facebook.com

youtube.com

PID1234

1267

3451

Queriesgoogle.com

facebook.com

youtube.com

Latency (ms)184

390

686

Timestamp2018-07-08 19:30:19

2018-07-08 19:30:20

2018-07-08 19:30:24

Total (ms)2056

2150

2909

Server (ms)500

175

268

...

...

...

...

Base State

Sprint State

If time > timeout

If budget == 0

Kill State

If time > kill timeout

Figure 2.3: Workflow of computational sprinting for web queries.

12

2.4.3 Algorithm

As shown in Figure 2.3, each query is associated with a parent process ID (PID)

as soon as its execution starts. Using this PID, we can keep track of all the child

PIDs and the amount of time spent on processing this query. If the current load

time is greater than the specified timeout, we move the query to sprint state. This

is achieved using taskset, a Unix command capable of modifying the core affinity of

processes. We use a global budget for sprinting. Every time, a query is processed in

sprint state, the energy consumed is deducted from the global budget. Eventually,

when the budget equals zero, any queries that are in sprint state are migrated back to

the base state. We use the perf stat Unix command to monitor energy consumption.

The following code snippet communicates the important methods in the program.

# Input Parameters# path webpages# energy budget# sp r i n t t i me ou t# k i l l t i m e o u t# i n t e r a r r i v a l s c a l e# b a s e c o r e s# s p r i n t c o r e s

t i m e l o g g e r = [ ]e n e r g y l o g g e r = [ ]s p r i n t e d p i d s = [ ]

de f Logger ( ) :i f ( f l a g == 1 ) :

r e turnthread ing . Timer ( 1 . 0 , Logger ) . s t a r t ( )time = stat cmd . run ( ) . ge t t ime ( )energy = stat cmd . run ( ) . g e t ene rgy ( )t i m e l o g g e r . append ( time )e n e r g y l o g g e r . append ( energy )

13

Logger ( )

de f k i l l ( ) :i f ( f l a g == 1 ) :

r e turnthread ing . Timer ( 0 . 1 , k i l l ) . s t a r t ( )p ids = get p ids cmd . run ( )et imes = get et imes cmd . run ( )i f l en ( p ids ) != 0 and l en ( et imes ) != 0 :

p i d s l i s t = pids . g e n e r a t e L i s t ( )e t i m e s l i s t = et imes . g e n e r a t e L i s t ( )f o r id , et ime in enumerate ( e t i m e s l i s t ) :

i f et ime > k i l l t i m e o u t :k i l l cmd . run ( )

k i l l ( )

de f s p r i n t ( ) :i f ( f l a g == 1 ) :

r e turnthread ing . Timer ( 0 . 1 , s p r i n t ) . s t a r t ( )p ids = get p ids cmd . run ( )et imes = get et imes cmd . run ( )i f l en ( p ids ) != 0 and l en ( et imes ) != 0 :

p i d s l i s t = pids . g e n e r a t e L i s t ( )e t i m e s l i s t = et imes . g e n e r a t e L i s t ( )f o r id , et ime in enumerate ( e t i m e s l i s t ) :

i f et ime > sp r i n t t i me ou t :i f sum( e n e r g y l o g g e r ) < energy budget

i f p i d s l i s t [ id ] not in s p r i n t e d p i d s :taskset cmd . run ( )s p r i n t e d p i d s . append ( p i d s l i s t [ id ] )

s p r i n t ( )

id = 0with open ( webpages , ’ r ’ ) as q u e r i e s :

f o r query in q u e r i e s :taskset chrome cmd ( base co re s , query ) . run ( )time . s l e e p ( e x p o n e n t i a l d i s t r i b u t i o n [ id ] )id = id+1

f l a g = 1

14

2.5 Results

The following list specifies the range of values used for experiments.

• Sprint Budget: 500, 3000, 4000, 5000, 6000, 7000, 30,000 (joules)

• Inter-arrival time: 2.6, 4.4, 6.6, 7.9, 8.4 (seconds)

• Timeout for Sprinting: 0, 4, 8, 16, 32 (seconds)

• No. of Pages loaded: 100

• No. Of Base Cores: 1

• No. Of Sprint Cores: 4

These tests were run on an Intel(R) Core (TM) i5-4210U CPU @ 1.70 GHz pro-

cessor with a total of 8 cores. Intel Turbo Boost was disabled by setting the governor

as userspace. This allows the user to set the operating frequency of the processor.

The following plots help us see the correlation on how the energy consumed varies

based on timeout for sprinting. We have grouped our results with inter-arrival time

between queries.

15

1995.12 1957.31 1930.32 1956.02 1972.39 1895.64

3284.41 3276.07 3197.78 3293.753165.06

3279.36

4877.86 4948.84 4947.43 4979.77 4925.665080.73

5894.33 5882.87 5943.25 5936.72 5848.276003.8

6282.23 6266.24 6317.126211.61 6206.85

6364.25

0

1000

2000

3000

4000

5000

6000

7000

0 4 8 12 16 32

ener

gy co

nsum

ed (

joul

es)

timeout for sprinting (seconds)

Timeout for Sprinting vs Energy consumed2.6 4.4 6.6 7.9 8.4

Figure 2.4: Timeout for Sprinting vs Energy Consumed with budget of 500J.

2307.822059.72 2100.2 2096.8

1964.99 1961.17

2936.18

3507.98

3244.91 3239.43 3323.99 3286.99

5178.735010.14

4845.484969.21 4958.46 4929.14

6106.865900.89 5837.76

5967.47 5979.48 6053.77

6485.866366.12

6043.16

6317.68 63966513.06

0

1000

2000

3000

4000

5000

6000

7000

0 4 8 12 16 32

Ener

gy C

onsu

med

(jo

ules

)


Timeout for sprinting vs Energy consumed2.6 4.4 6.6 7.9 8.4


16

2187.64 2101.08 2175.54 2203.32 2119.231939.12

3578.45 3501.33 3413.56 3308.6 3276.12 3301.86

5256.285076.15

5237.51

4964.43 4953.87 4946.66

6114.95963.39 6050.54 6008.42 5912.04 5925.72

6422.91 6398.36 6370.11 6294.81 6276.18 6284.98

0

1000

2000

3000

4000

5000

6000

7000

0 4 8 12 16 32

ener

gy co

nsum

ed (

joul

es)


Timeout for sprinting vs Energy Consumed

2.6 4.4 6.6 7.9 8.4


2278.122160.73 2092.95 2063.4 1971.38 1964.69

3539.13 3544.81 3524.823417.94

3312.693180.38

5263.925064.14 5108.6 5091.67 5002.32 4988.24

6141.21 6076.95 6093.42 6078.155875.82 5941.85

6428.016321.3

6452.6 6394.87

6122.07 6208.14

0

1000

2000

3000

4000

5000

6000

7000

0 4 8 12 16 32

ener

gy co

nsum

ed (

joul

es)

Timeout for sprinting (seconds)

Timeout for Sprinting vs Energy Consumed2.6 4.4 6.6 7.9 8.4


17

2226.96 2270.42078.01 2016.44 1981.73 1973.32

3574.82

3256.293462.58 3365.73 3365.85

3259.36

5147.23 5083.9

5518.62

5189.25008.34 4972.47

6009.17 5977.676093.88 6093.27 6069.67

5927.6

6247.466056.84

6414.48 6346.71 6452.336269.68

0

1000

2000

3000

4000

5000

6000

7000

0 4 8 12 16 32

ener

gy co

nsum

ed (

joul

es)


Timeout for Sprinting vs Energy Consumed2.6 4.4 6.6 7.9 8.4


2286.46 2201.93 2217.352030.52 2072.71 2034.57

3559.053352.26

3220.093364.11 3344.4 3316.94

5082.24926.13

5209.245078.16 5084.88 5046.74

6075.68 6070.72 6052.48 6132.79 6044.39 6095.29

6487.94 6392.11 6482.07 6492.29 6402.85 6439.29

0

1000

2000

3000

4000

5000

6000

7000

0 4 8 12 16 32

ener

gy co

nsum

ed (

joul

es)


timeout for sprinting vs energy consumed2.6 4.4 6.6 7.9 8.4


18

2228.742090.65 2129.42

2014.1 1982.88 1978.42

3531.45 3523.24 3442.24 3367.21 3305.13 3282.1

5178.78 5239.31 5153.784980.99 4975.08 4951.94

6189.07 6148.02 6172.896069.75 5982.3 5914.24

6618.53 6528.24 6564.526416.56 6385.64

6245.63

0

1000

2000

3000

4000

5000

6000

7000

0 4 8 12 16 32

ener

gy co

nsum

ed (

joul

es)


Timeout for sprinting vs Energy Consumed2.6 4.4 6.6 7.9 8.4


The first observation in all the above plots is the trend of increasing energy con-

sumption for fixed energy budget and fixed timeout as inter-arrival time between

queries increases. This only makes sense since increased inter-arrival time means the

processor spends more time processing queries and hence the consumption of energy

also increases.

We can also see in the above figures on how for a fixed energy budget and fixed

inter-arrival time, the energy consumption varies. As the timeout for sprinting in-

creases, the amount of energy consumed decreases relatively. There are two ways we

can reason for the above observation.

1. Since more pages are done loading within the timeout, the energy spent on

sprinting for loading pages will be lesser.

19

2. The amount of time spent on sprint state will be lesser as timeout increases,

hence the energy consumption accordingly reduces.

For instance, for all inter-arrival times in Figure 2.10, we can see a steady decrease

in the energy consumed as timeout increases. In this specific case, an energy budget

of 30,000J virtually represents unlimited sprinting. With timeout value as zero, every

query will be executed in sprint state till its completion. With the increase in timeout,

energy increases.

25.69326.556 26.838 26.871 26.795

26.314

21.5122.429

24.01122.908

24.375

22.892

19.331 19.20818.656

19.451 19.411 19.386

16.09215.607

16.619

18.18

18.773

19.538

16.541 16.19416.537

16.746

18.05517.169

0

5

10

15

20

25

30

0 4 8 12 16 32

Avg.

Load

Tim

e (s

econ

ds)


Avg. Load time vs Timeout for Sprinting2.6 4.4 6.6 7.9 8.4

Figure 2.11: Timeout for Sprinting vs Avg. Load Time with budget of 500J.

20

22.402

26.26

23.881 24.074

28.318

27.164

21.564 21.95921.368

22.386 22.57823.084

17.18

18.65

16.761

18.813 18.656

20.247

15.889

19.246

15.732

16.91617.498

17.969

17.284

16.987

14.548 16.6717.762 18.339

0

5

10

15

20

25

30

0 4 8 12 16 32

Avg.

Load

Tim

e (s

econ

ds)


Avg. Load Time vs Timeout for Sprinting2.6 4.4 6.6 7.9 8.4


25.096

26.608

24.56423.837

24.417

26.975

19.663

21.922 21.995 23.52622.875

24.246

17.40116.803

18.506 18.771 19.154 19.241

15.027

16.763

16.10816.77

17.43318.13

14.592

16.156

16.882

16.756 17.045

17.587

0

5

10

15

20

25

30

0 4 8 12 16 32

Avg.

Load

Tim

e (s

econ

ds)




21

24.161 24.335

26.13326.977

26.237

27.685

18.227

21.743 21.69222.442

21.93

23.094

15.111

16.929

18.38 18.33517.501

19.86

14.945

14.55815.141

16.179

14.942

18.664

14.036 14.797

15.497

16.783

15.46

16.504

0

5

10

15

20

25

30

0 4 8 12 16 32

Avg.

Load

Tim

e (s

econ

ds)


Avg. Load time vs Timeout for Sprinting2.6 4.4 6.6 7.9 8.4


23.574

24.73525.774 25.536

26.736 26.698

19.786

24.114

21.213

22.34122.879 22.895

19.374

17.524

18.62917.831

18.54119.117

16.327 16.269 16.616716.125

17.33317.941

16.64

19.247

14.015

15.519

16.542

17.934

0

5

10

15

20

25

30

0 4 8 12 16 32

Avg.

Load

Tim

e (s

econ

ds)




22

25.05624.343

26.42827.137 27.315 27.47

19.411

23.992

22.765 22.954 23.059 23.393

17.51818.226 18.027

19.818 20.04 19.682

17.906

15.32514.773

18.525

18.55216.669

15.72516.542

14.986

17.60118.349

16.766

0

5

10

15

20

25

30

0 4 8 12 16 32

Avg.

Load

Tim

e (s

econ

ds)




24.562

26.599

24.11

26.827 26.68127.669

20.062

21.177

22.629 22.407 22.51523.565

15.69616.189

17.078

18.32119.051

20.387

14.60915.254

16.49617.52

18.19919.275

12.419 14.608

14.91317.315 15.662 18.983

0

5

10

15

20

25

30

0 4 8 12 16 32

Avg.

Load

Tim

e (s

econ

ds)




23

The above plots depict the relation between average load time and timeout for

sprinting with varying energy budget. The main observation is when the inter-arrival

time and energy budget is fixed. In Figure 2.17, the average load time when time-

out for sprinting is 0 is 12.49 seconds. As timeout increases, the average load time

increases. However, anomalies are sometimes observed, for instance at timeout for

sprinting at 12 seconds, we see a spike in average load time.

28

24 2523 23

29

4845

33

48

37

42

6164

5956

59 60

71

78

73

65

60

58

73

80

74

70

60

74

0

10

20

30

40

50

60

70

80

90

0 4 8 12 16 32

No. o

f pag

es lo

aded


No. of Pages Loaded vs Timeout for Sprinting2.6 4.4 6.6 7.9 8.4

Figure 2.18: Timeout for Sprinting vs Pages Loaded with budget of 500J.

24

44

31

36 37

16

23

53

58 57

5250

46

6461

77

6163

54

74

60

75 74

70 6972 72

82

73

6770

0

10

20

30

40

50

60

70

80

90

0 4 8 12 16 32

No. o

f pag

es lo

aded




36

25

3634 33

19

60

51 50

46 47

42

75

79

69

63 64 63

79

71

82

72

80

69

82

7174 73

79

71

0

10

20

30

40

50

60

70

80

90

0 4 8 12 16 32

No. o

f pag

es lo

aded




25

35 34

2426 27

17

65

47

52

47

55

45

78

69

6366

71

59

76

82 81

75

81

64

84

7976

76

73 74

0

10

20

30

40

50

60

70

80

90

0 4 8 12 16 32

No. o

f pag

es lo

aded




38

33 3330

24 24

60

34

55

48 4750

66 66

7169

66

61

7275

79 78

72

63

70

61

85

7673

62

0

10

20

30

40

50

60

70

80

90

0 4 8 12 16 32

No. o

f pag

es lo

aded

Timeout for sprinting



26

36 35

28

23

18 18

60

39

43

48 48 48

72

59

76

5759

57

63

75

84

69

64

72

77

72

84

70 71 72

0

10

20

30

40

50

60

70

80

90

0 4 8 12 16 32

No. o

f pag

es lo

aded

Timeout for sprinting


Figure 2.23: Timeout for Sprinting vs Pages Loaded with budget as 7000J.

36

29

36

2724

17

60

54

50 49 49

43

8078

75

67

58 59

76

7976

74

65

59

8781

82

74

81

61

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 32

No. o

f pag

es lo

aded


Timeout for Sprinting vs No. of Pages Loaded 2.6 4.4 6.6 7.9 8.4

Figure 2.24: Timeout for Sprinting vs Pages Loaded with budget as 30000J.

27

The above charts plot the number of pages that successfully loaded against timeout

for sprinting with varying energy budgets. With inter-arrival time increasing, the

number of successful page loads also increase. This is because of the decreasing

amount of termination of stale queries.

An important observation in Figure 2.20 is that with inter arrival time as 2.6 sec-

onds, 30 pages are loaded on average with sprinting at different timeouts. However,

with 32 as the timeout value, the case where there is no sprinting possible, we have

only 19 pages loaded. This trend can be observed in other figures as well like Fig-

ure 2.21, Figure 2.22, and Figure 2.24. We can conclude that sprinting does indeed

help in loading extra web pages that would otherwise not load given the conditions.

2.6 Discussions

We divide our sprinting policy into three categories: 1) No Sprinting 2) Average

Sprinting 3) Always Sprinting. They can be correlated to worst-case, average-case

and best-case scenarios respectively. Table 2.1 below reveals the improvement in

employing computational sprinting for webpage loads. The average inter-arrival time

is 5.98 seconds.

Category Energy Consumed Avg. Load Time Avg. Page LoadsWorst-Case 4507.35J 21.21s 51.82

Average-Case 4567.24J 19.97s 58.11Best-Case 4657.24J 18.58s 63.17

Table 2.1: Results for computational sprinting on web queries.

28

By using computational sprinting, we can decrease the average load time by

5.86% (average-case) and 12.6% (best-case). The number of page loads increase by

12.12% (average-case) and 21.88% (best-case). Accordingly, the energy consumption

increases by 1.33% (average-case) and 3.33% (best-case).

While we can see trends in our results, the number of anomalies are equally high.

This implies that this technique works in speeding up web queries but we cannot

completely rely on them. One of the main reasons for this could be the delay in moving

from base state to sprint state for a specific query. The delay and improvement gained

from sprinting cancels out each other and we are left with no net improvement.

In the future, we can group a bunch of queries and sprint them all together at the

same time which would reduce the overhead of changing states. But the downside

would be the depletion of sprint budget rapidly. All experiments were carried out

on Chrome browser on laptops. Since mobile browsers have surpassed the internet

usage, we can look at manipulating the cores of mobile devices.

29

Chapter 3: Computational Sprinting with Intel Cache

Allocation Technology

3.1 Introduction

In 2016, Intel introduced the cache allocation technology (CAT) in the Xeon 2600

V4 architecture [14]. Figure 3.1 compares the traditional caching approach to CAT.

CAT functions by storing tags for both virtual memory and process ID. Cache hits

must match the process ID. CAT lets workloads share last-level cache and exclusively

reserve cache lines. Workloads designed to execute on shared servers can now have

guaranteed SRAM cache. Integrated SRAM supplies data 12X faster than DRAM.

CAT substantially speed up workloads that exhibit cache locality.

SRAM is created using only electronic switches and transistors. It is the only

memory that can be synchronously integrated with processor clocks. SRAM repre-

sents the majority of transistors on modern processors. Since 2012, SRAM storage

on Xeon processors has grown by 40%. L3 caches were present in only 25% of Xeon

processors in 2012, but are now present in 40%. When manufacturing processes sup-

port 5nm transistors, SRAM storage on the chip will increase further. The Xeon 2800

has a cache size of 60 MB. This server recently broke the world record for TPC-H

QphH@Size (Composite Query per Hour Performance Metric). TPC-H is a decision

support benchmark which is adopted hugely in industry [5].

30

li $t1, <01> # target CAT group 1mtc0 $t1, $16 # set CAT groupjr # return to user code

lw $t0,$zero[516] #load mem[516]

Legend

multiplexor

register coreoperation

data path

cache line

10000001 00 01 cat grp ($16)

0000

0001

1100

0101

CAT offsets & lengthsbyte offset

memory address

indextranslation

set associative cachetag v data tag v data

= =

or hit or miss

addr%lng + offset

Figure 3.1: With Intel CAT, systems software manages cache lines.

Few workloads need exclusive access to large amounts of cache. For example,

small workloads such as 1 GB TPC-H databases perform best with L2 cache of 10

MB. It is important to note that 80% of databases worldwide are less than 1 GB.

However, competing workloads cause jitter and can degrade throughput by 15%. One

solution is to over-provision the cache, giving each workload access to more cache than

needed. Over-provisioning is (1) hard to control and (2) wasteful. On a Xeon 2800,

over-provisioning a database that uses 10 MB by 25% degrades parallelism by 25%.

CAT reduces jitter. Workloads sensitive to cache interference can now run on shared

servers without over-provisioning. Significantly reducing the cost to execute these

workloads in the cloud.

31

The main problems with state of the art approaches are listed below.

• Over-provisioning is too costly

• Cache interference

• Cloud resources execute this workload inefficiently

• Economies of scale as the database grows

Multi-core machines can run processes sharing the software to manage on-chip

cache lines (SRAM). Guest workloads can now exclusively reserve cache lines and ex-

ploit integrated SRAM (100X faster than main memory) to run efficiently on shared

CPU. However, SRAM has low memory density. SRAM should be reserved parsi-

moniously. Static management policies over-provision the SRAM, fragment cache

between workloads, trigger unneeded scale-out and, degrade whole system perfor-

mance. In contrast, computational sprinting is an innately parsimonious approach.

A workload is initially allocated few resources, but if a service level objective (SLO)

violation becomes likely, additional resources are allocated and the workload is sped

up, i.e., a sprint. Sprinting with CPU resources, e.g., voltage and core scaling, can

reduce SLO violations by 95%. Cache allocation technology supports a new sprinting

mechanism: at-risk workloads could temporarily access more cache lines. This tool

has the ability to modify cache lines in the range of 400-700 milliseconds. This would

be adequate for workloads that take 2-3 seconds to complete execution. However,

the downside is that cache interference affects speedup and predictability. Hence, our

study focuses on studying the feasibility of utilizing cache allocation technology for

scheduling workloads on servers.

32

3.2 Design and Implementation

Intel’s CAT tool enables control over last-level cache by introducing Class of Ser-

vice (CLOS) which represents a group of processes (or) threads (or) applications. A

process can be added or removed from a CLOS. Every CLOS has a resource capacity

bitmask (CBM) that indicates the amount of cache available for use. However, there

are two constraints in creating a CLOS. The allocated ways must be contiguous and

each CLOS should have at least two ways. These ways are mapped logically to pro-

cessors. Assume a process was assigned to CLOS A. Due to subdued performance, it

is moved to CLOS B which has more cache lines assigned to it. This doesn’t imply

that the cache hits for this process only come from CLOS B, they can be serviced by

cache lines of CLOS A as well.

Query Generator

arrivalrate

work-load timeout

% ofinterruption

basecache

arrivaldist.

workload conditions

Queue Manager

sprint policies

tasksBase State

PollutedState

ExecutionEngine

Response Time

Figure 3.2: Implementation of workload profiling in our approach.

We use Rodinia, a benchmark for heterogeneous computing for our experiments

[32]. From Figure 3.2, we can see that the query generator uses arrival rate, workload

type, and arrival distribution to produce HTTP query requests. All the requests enter

33

a FIFO queue. The query at the head of queue starts execution once a slot opens

in the execution engine and the query is popped from the queue. Every query has a

timestamp variable which helps in calculating the time spent in queue and execution.

Every request has a base cache that is set. The cache capacity of each query is toggled

between a base state and polluted state based on a Poisson distribution. The query

manager makes decisions based on timeout in our case. The response time and other

metrics of cache usage are recorded locally using perf stat Unix tool.

3.3 Results

The following experiments are to profile the impact of L3 cache on the workload’s

performance. We use the results from this experiment to decide on how many cache

lines would be ideal to use for base mode and sprint mode.

The processor specs used for experiments are as follows: Intel(R) Xeon(R) CPU

E5-2620 v4 @ 2.10GHz with a total of 16 cores. The shared L3 cache size is 20 MB.

We can observe from Figure 3.3 that there is a stark decrease in the execution time

of the workload as the number of cache ways increase. There is a 71.95% decrease in

runtime between using 1 MB and 2 MB of L3 cache. Hence, Jacobi would perform a

lot better if it’s assigned 2 MB rather than 1 MB. The best response time is achieved

at 4 MB of cache with a 75.3% decrease and over-provisioning has no effect. Actually,

it causes extra latency in some cases. We can also see that the percentage of last-level

cache load misses in cache hits reduces as the size of cache assigned increases.

34

170.453

47.80142.532 42.098 42.617 42.322 42.598 43.213

91.94

44.3837.37

37.2636.88 37.01 37.8 37.95

0

20

40

60

80

100

120

140

160

180

1 2 3 4 8 12 16 20

Cache ways

Jacobi Cache PerformanceTime (sec) % of LLC Load Misses of all LLC hits

Figure 3.3: Performance of Jacobi with 20 MB L3 cache.

43.28341.863 41.787 41.791

41.76741.704 41.8 41.797

89.79

53.851.46

53.2951.7

50.15 51.41 52.72

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 8 12 16 20

Cache ways

K-Means Cache PerformanceTime (sec) % of LLC Load Misses of all LLC hits

Figure 3.4: Performance of K-means with 20 MB L3 cache.

35

76.198

69.871 70.016 69.457 70.087 69.464 69.589 69.34676.01

47.41

35.64

28.31

21.51 20.98 20.18 19.92

0

10

20

30

40

50

60

70

80

90

1 2 3 4 8 12 16 20

Cache ways

BFS Cache PerformanceTime % of LLC Load Misses of all LL cache hits

Figure 3.5: Performance of BFS with 20 MB L3 cache.

806.194

699.156 697.826 697.1 697.775 698.261 694.268 694.279

74.6552.47 51.96 52.21 52.01 52.68 50.66 50.66

0

100

200

300

400

500

600

700

800

900

1 2 3 4 8 12 16 20

Cache ways

Backprop Cache PerformanceTime % of LLC Load Misses of all LL cache hits

Figure 3.6: Performance of Backprop with 20 MB L3 cache.

36

158.866

86.211 85.462

78.944

71.49167.299 66.036 66.348

89.34

73.38

58.53

45.03

19.59

7.332.69 1.26

0

20

40

60

80

100

120

140

160

180

Cache ways

CFD Cache PerformanceTime (sec) % of LLC Load Misses of all LLC hits

Figure 3.7: Performance of CFD with 20 MB L3 cache.

While we can see the same trend in Figure 3.5, Figure 3.6, and Figure 3.7, the

magnitude of decrease in execution time is minimal. On the other hand, the execution

time stays the same in Figure 3.4. We can conclude that the performance of this

workload doesn’t depend on the L3 cache.

For our following experiments, we choose Jacobi as our sole workload. We would

like to learn the level of cache interference that each workload can handle. In reality,

we would have two competing and co-located workloads A and B, where each of

them are assigned a base cache capacity. As A’s performance goes down, to maintain

its SLO it will try to access the cache lines associated with B. This may or may

not influence the behavior of workload B. The aftermath fairly relies on the type of

workloads. A positive scenario would be when a cache-intensive workload and I/O

intensive workload are co-located. The constant interference from cache-intensive

37

workload on the other workload wouldn’t have a deterring effect on the performance.

However, if both workloads rely majorly on their L3 cache, it wouldn’t be wise to

co-locate them. In practice, we cannot profile each workload and accordingly allocate

them on the servers.

To study the effects of the above-mentioned scenario, we use a base state with

4 MB L3 cache and polluted state with 1 MB L3 cache. This setup emulates the

incessant intervention from other competing workloads.

• Timeout: 100 seconds

• Arrival Rate: 0.75, 0.95 % of Service Rate

• Service Rate: 0.0238

• Base cache State: 4 MB

• Polluted Cache State: 1, 2, 3 (MB)

38

73.075

101.728

130.626

171.955

266.485

283.832

90.869

140.203

153.964

210.098

279.377293.984

0

50

100

150

200

250

300

350

0 20 40 60 80 100

avg.

run

time

% of interruption during execution

Percentage of Interruption vs Avg. Run Time0.75 0.95

Figure 3.8: Percentage of Interruption vs Avg. runtime for Jacobi with base cachestate as 4 MB and polluted cache state as 1 MB.

73.333 73.295 72.846 73.48576.565

77.912

90.551 90.796 90.483 90.539 90.935 90.799

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

avg.

run

time




39

73.479 73.496 73.459 73.431 73.455 72.863

90.417 90.696 90.35 90.262 90.383 90.463

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

avg.

run

time




From Figure 3.8, we can see that as the percentage of interruption during execution

increases, the average runtime of the workload increases. This trend is visible when

the base cache state is 4 MB and polluted cache state is 1 MB. However, when polluted

cache states are 2 MB and 3 MB, the increase in the average runtime of the workload

is minimal. Also, as arrival rate increases, the average runtime accordingly increases.

When base cache state and polluted cache state are 4 MB and 1 MB, we observe an

average of 41.37 seconds increase in runtime of Jacobi workload for every 20% increase

in interruption from other workloads. We can conclude that for a specific workload,

on offline profiling we can understand the effect of L3 cache on its performance. As

we mentioned before, this is not feasible in practice. For future work, we propose a

machine learning approach to accurately map new workloads with cache ways. This

approach is inspired by positive results from the work of Nathaniel et. al [22]. To

40

implement the machine learning model, we collected data from the following setup.

Each server-worker unit has the ability to be launched as a separate Docker instance.

The last-level cache is shared between all docker instances. We vary the arrival rate,

timeout, cores, cache, and budget. The bulk data from these experiments is fed into

a Random Decision Forest for training phase. The bias is greatly reduced while using

random decision forests as they build deep trees.

41

Chapter 4: Study on Service-Level Objectives

4.1 Introduction

Service Level Objectives (SLOs) are the fundamental constituents of Service Level

Agreements (SLAs). SLA is an agreement between the customer and service provider.

This is pre-established to avoid confusion or disputes between both parties. Quality,

availability and responsibility are some of main aspects that are discussed quantita-

tively in SLOs. The common metrics in practice are Abandonment Rate, Average

Speed to Answer (ASA), Time Service Factor (TSF), First-Call Resolution (FCR),

Turn-Around Time (TAT) and Mean Time To Recover (MTTR). The result of not

meeting SLOs would result in financial loss for the provider. For example, if the

monthly up-time percentage drops below 95% in a month for a customer of Google

Cloud, Google has to refund 50% of monthly bill for the respective service [12].

SLOs are integral to cloud services. They are used in networking infrastructure,

micro-services, data centers and cloud applications. As the adoption of SLOs is

increasing in the market, system managers are required to set up performance goals

for SLOs. This serves as a motivation to study common practices and misconceptions

in the design of SLOs.

42

4.2 Analysis and Discussions

We collected 9,634 documents that contained SLOs. By applying a Systematic

Literature Review, we were able to come up with the following analysis. After final-

izing features of importance such as response time, delay and reporting period, we

were able to trim our dataset by eliminating unwanted SLO documents. Removing

outliers was also a means of cleaning our dataset. By the end, we had 75 SLOs with

34 being academic sources and 41 being industry sources.

SLO Delay (ms)

SL

O P

erc

enti

le

<10 10–100 100–1000 >1000

90.0%

95.0%

98.0%

99.0%

99.5%

99.6%

99.7%

99.8%

No matching SLO

2 matching SLO

5 matching SLO

Figure 4.1: Heat map of delay and percentile goals from industry sources.

Single-digit response goals are challenging. From Figure 4.1, we can observe

that there is only one matching SLO where the delay is lesser than 10 ms. This

shows that achieving single-digit response goals is a challenging task. The growing

availability of hardware devices that work in the microsecond range has instigated the

“era of the killer micro-second” [3]. Improvements in redesigning the system stack

43

and modifying traditional low-level system optimizations to be micro-second aware

will help in accomplishing single-digit response goals.

Black swans prohibit extreme percentiles for complex software. We can

also observe from the lower half of Figure 4.1 that we don’t have a majority of SLOs

reported in the range of 99.8% and 99.5% percentile. Extreme SLO percentiles are not

viable for complex software systems as even scarce occurrences of black swans could

violate the SLOs. It is important to note that this observation is valid only for software

systems that require high computing and interaction with hardware devices. Since

black swans are correlated, it makes it even harder to have intense SLO percentiles.

Evaluation of complex cloud services should vary percentiles. Based on

the SLOs collected across industry and academia, we observed that SLO percentile

parameter varies in a scattered fashion from 90.0% to 99.9% for both types. On the

other hand, the SLO percentiles for networking services is concentrated around 99.0

and 99.9%. We can infer that to evaluate complex cloud services, varying percentiles

is advisable.

Evaluation of infrastructure should vary the delay. We observe that the

delay parameter is distributed all across the range for the SLOs of networking services

in academia and industry. Contrarily, the delay parameter is gathered around the

128ms region for application services. Hence, whilst devising SLOs for networking

services, varying the delay is essential.

We need evaluation tools for long reporting periods. The SLO timeframe

parameter specifies the time between two consecutive checks for SLO violations. The

timeframe is fixed based on the complexity and type of service being offered. Complex

services have reporting periods in days and months. For microservices, reporting

44

periods are in the order of seconds. However, lesser than 15% of SLOs have reporting

periods in the range of 5 - 10 minutes.

45

Chapter 5: Conclusion and Future Work

Our work has broadly connected the various aspects of computational sprinting.

First, we developed a sprinting mechanism to speed up web queries. This was achieved

by implementing a chrome extension and the Alexa Top 500 webpages were used for

experiments. Secondly, we addressed the problem of over-provisioning workloads in

the industry by devising a system that employs the Intel Cache Allocation Technology

tool to dynamically allocate L3 cache. A workload profiler and queue simulator were

used to realize the same. Lastly, a systematic literature review was carried upon

SLOs in industry and academia. Multiple conclusions were drawn that will help

system managers in fixing SLOs in the future.

46

Bibliography

[1] Ankur Agrawal et al. “Approximate computing: Challenges and opportunities”.In: ICAC. 2016.

[2] Md Tanvir Al Amin et al. “Social Trove: A Self-Summarizing Storage Servicefor Social Sensing”. In: ICAC. 2015.

[3] Luiz Andre Barroso et al. “Attack of the killer microseconds.” In: Communica-tions of the ACM (2017).

[4] Mike Belshe and Roberto Peon. “SPDY protocol”. In: (2012).

[5] Peter Boncz, Thomas Neumann, and Orri Erling. “TPC-H analyzed: Hiddenmessages and lessons learned from an influential benchmark”. In: Technol-ogy Conference on Performance Evaluation and Benchmarking. Springer. 2013,pp. 61–76.

[6] Marco Brocanelli and Xiaorui Wang. “Smartphone Radio Interface Managementfor Longer Battery Lifetime”. In: ICAC. 2017.

[7] Calin Cascaval et al. “ZOOMM: a parallel web browser engine for multicoremobile devices”. In: ACM SIGPLAN Notices. Vol. 48. 8. ACM. 2013, pp. 271–280.

[8] Xin Chen et al. “GOVERNOR: Smoother Stream Processing Through SmarterBackpressure”. In: ICAC. 2017.

[9] Intel Coorporation. Intel turbo boost technology in Intel core microarchitecture(Nehalem) based processors. 2008.

[10] Nan Deng et al. “Adaptive Green Hosting”. In: IEEE International Conferenceon Autonomic Computing. 2012.

[11] Dalibor D Dvorski. “Installing, configuring, and developing with Xampp”. In:Skills Canada (2007).

[12] Google Compute Engine Service Level Agreement (SLA). 2018. url: https://cloud.google.com/compute/sla.

[13] Md E Haque et al. “Few-to-many: Incremental parallelism for reducing taillatency in interactive services”. In: ACM SIGARCH Computer ArchitectureNews 43.1 (2015), pp. 161–175.

47

[14] Andrew Herdrich et al. “Cache QoS: From concept to reality in the Intel R©Xeon R© processor E5-2600 v3 product family”. In: 2016 IEEE InternationalSymposium on High Performance Computer Architecture (HPCA). IEEE. 2016,pp. 657–668.

[15] Chang-Hong Hsu et al. “Adrenaline: Pinpointing and reining in tail queries withquick voltage boosting”. In: 2015 IEEE 21st International Symposium on HighPerformance Computer Architecture (HPCA). IEEE. 2015, pp. 271–282.

[16] Eaton K. How one second could cost Amazon $1.6 billion in sales. 2013. url:https://www.fastcompany.com/1825005/how-one-second-could-cost-

amazon-16-billion-sales.

[17] Jaimie Kelley et al. “Measuring and Managing Answer Quality for Online Data-Intensive Services”. In: ICAC. 2015.

[18] Haohui Mai et al. “A case for parallelizing web pages”. In: Presented as part ofthe 4th {USENIX} Workshop on Hot Topics in Parallelism. 2012.

[19] Rick Merritt. “ARM CTO: power surge could create’dark silicon’”. In: EETimes, Oct (2009).

[20] James Mickens. “Silo: Exploiting JavaScript and DOM Storage for Faster PageLos.” In: WebApps. 2010.

[21] Alan L Montgomery and Christos Faloutsos. “Identifying web browsing trendsand patterns”. In: Computer 34.7 (2001), pp. 94–95.

[22] Nathaniel Morris et al. “Model-driven computational sprinting”. In: Proceedingsof the Thirteenth EuroSys Conference. ACM. 2018, p. 38.

[23] Nathaniel Morris et al. “Sprint ability: How well does your software exploitbursts in processing capacity?” In: 2016 IEEE International Conference onAutonomic Computing (ICAC). IEEE. 2016, pp. 173–178.

[24] Ravi Netravali et al. “Polaris: faster page loads using fine-grained dependencytracking”. In: 13th {USENIX} Symposium on Networked Systems Design andImplementation ({NSDI} 16). 2016.

[25] Ajay Panyala et al. “Approximate Computing Techniques for Iterative GraphAlgorithms”. In: ICAC. 2017.

[26] Nathaniel Pinckney et al. “Shortstop: An on-chip fast supply boosting tech-nique”. In: 2013 Symposium on VLSI Circuits. IEEE. 2013, pp. C290–C291.

[27] Arun Raghavan et al. “Computational sprinting”. In: IEEE international sym-posium on high-performance comp architecture. IEEE. 2012, pp. 1–12.

[28] Daniel A. Reed and Jack Dongarra. “Exascale Computing and Big Data”. In:Commun. ACM 58.7 (June 2015), pp. 56–68. issn: 0001-0782. doi: 10.1145/2699414. url: http://doi.acm.org/10.1145/2699414.

48

[29] S Ren and MA Islam. “Colocation demand response: Why do I turn off myservers?” In: ICAC. 2014.

[30] S Ren et al. “Exploiting processor heterogeneity in interactive services”. In:ICAC. 2013.

[31] Siva Renganathan et al. “Preliminary Results on an Interactive Learning Toolfor Early Algebra Education”. In: IEEE Frontiers in Education. 2017.

[32] Rodinia: A Benchmark Suite for Heterogenous Computing. 2018. url: http://lava.cs.virginia.edu/Rodinia/download_links.htm.

[33] Jim Roskind. “QUIC: Multiplexed stream transport over UDP”. In: Googleworking design document (2013).

[34] StatCounter Global Stats: Browser, OS, Search Engine. 2018. url: http://gs.statcounter.com/.

[35] Christopher Stewart, Aniket Chakrabarti, and Rean Griffith. “Zoolander: Effi-ciently Meeting Very Strict, Low-Latency SLOs”. In: Proceedings of the 10th In-ternational Conference on Autonomic Computing ({ICAC} 13). 2013, pp. 265–277.

[36] Christopher Stewart, Terence Kelly, and Alex Zhang. “Exploiting Nonstation-arity For Performance Prediction”. In: European Conference on Computer Sys-tems. 2007.

[37] Christopher Stewart, Matthew Leventi, and Kai Shen. “Empirical Examinationof A Collaborative Web Application”. In: IEEE International Symposium onWorkload Characterization. 2008.

[38] Christopher Stewart and Kai Shen. “Performance Modeling and System Man-agement for Multi-component Online Services”. In: Symposium on NetworkedSystems Design and Implementation. 2005.

[39] Christopher Stewart et al. “A Dollar From 15 Cents: Cross-Platform Manage-ment for Internet Services”. In: USENIX Annual Technical Conference. 2008.

[40] Christopher Stewart et al. “Profile-driven Component Placement for Cluster-based Online Services”. In: International Middleware Conference. 2004.

[41] Michael B Taylor. “Is dark silicon useful? Harnessing the four horsemen of thecoming dark silicon apocalypse”. In: DAC Design Automation Conference 2012.IEEE. 2012, pp. 1131–1136.

[42] Top Sites in United States - Alexa. 2019. url: https://www.alexa.com/

topsites/countries/US.

[43] Using Page Speed in mobile search ranking. 2018. url: https://webmasters.googleblog.com/2018/01/using-page-speed-in-mobile-search.html.

[44] Cheng Wang et al. “Effective Capacity Modulation as an Explicit Control Knobfor Public Cloud Profitability”. In: ICAC. 2016.

49

[45] Xiao Sophia Wang, Arvind Krishnamurthy, and David Wetherall. “Speeding upweb page loads with shandian”. In: 13th {USENIX} Symposium on NetworkedSystems Design and Implementation ({NSDI} 16). 2016, pp. 109–122.

[46] Zichen Xu et al. “CADRE: Carbon-Aware Data Replication for Geo-DiverseServices”. In: ICAC. 2015.

[47] Huazhe Zhang and Henry Hoffmann. “Maximizing performance under a powercap: A comparison of hardware, software, and hybrid techniques”. In: ACMSIGARCH Computer Architecture News 44.2 (2016), pp. 545–559.

[48] Kuangyu Zheng, Bruce Beitman, and Xiaorui Wang. “CoSmart: CoordinatingSmartphone with Desktop Computer for Joint Energy Savings”. In: ICAC. 2017.

50

Documents

Exploring Computational Sprinting in New Domains