Intelligent Data Caching in the LTE Telecom · PDF fileIntelligent Data Caching in the LTE Telecom Network ... 2009 by TeliaSonera and Ericsson (Ericsson AB, 2012). LTE is in common

Intelligent Data Caching in the LTE Telecom Network

E L I N A M E I E R

Master of Science Thesis Stockholm, Sweden 2012

Intelligent Data Caching in the LTE Telecom Network

E L I N A M E I E R

DD221X, Master’s Thesis in Computer Science (30 ECTS credits) Degree Progr. in Computer Science and Engineering 300 credits Royal Institute of Technology year 2012 Supervisor at CSC was Alexander Baltatzis Examiner was Olle Bälter TRITA-CSC-E 2012:077 ISRN-KTH/CSC/E--12/077--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc

Intelligent Data Caching in the LTE Telecom

Network

Abstract

This degree project, conducted at Ericsson AB, investigates if an intelligent caching

algorithm can achieve better cache hit ratio than a general replacement algorithm in the

Long Term Evolution (LTE) telecom network environment. The intelligence refers to

the use of network statistics available in the operating system. The project includes

research on memory management systems, memory access patterns and caching

algorithms. A new algorithm, Three-Queue-Metric (3QM), is designed to conform to

the characteristics of the access patterns demonstrated on different traffic scenarios. It is

empirically shown that 3QM performs better than all other algorithms included in the

study on a workload specific to the LTE network. It is therefore concluded that network

statistics indeed can be part of a process to improve cache performance.

En statistik-baserad cachingalgoritm för LTE-

basstationer

Sammanfattning

Detta examensarbete, utfört på Ericsson AB, ämnar undersöka huruvida en intelligent

cachingalgorithm kan förbättra cacheminnets prestanda i ett Long Term Evolution

(LTE) nätverk jämfört med en generell utbytesalgoritm. Intelligensen syftar till

användning av nätverkstatistik som finns tillgängligt i operativsystemet tillhörande en

basstation. Arbetet innefattar även en undersökning av minneshanteringssystem,

minnesåtkomstmönster samt utbytesalgoritmer. Studien leder till att en ny algoritm,

kallad Three-Queue-Metric (3QM), designas med avseende att tillgodose alla

accessmönster som uppvisats av de olika trafikscenarion och utvärderade algoritmer.

Det fastslås empiriskt att 3QM påvisar bättre prestanda än någon av de andra

implementerade algoritmerna på de specifika indata som motsvarar ett LTE nätverk.

Därför dras slutsatsen att nätverksstatistik kan användas för att förbättra cacheminnets

prestanda.

Preface

This degree project is the final task of my Master in Computer Science and I would like

to thank every one that has helped and supported me throughout the project. I have had

a wonderful time carrying out this intriguing task.

A special thank to:

Kenneth Hilmersson, Per-Olof Gatter and Arvid Persson, Ericsson, for their

invaluable advice, effort and enhance,

Alexander Baltatzis and Olle Bälter, KTH, for their support, engagement and for

taking me on as a thesis student,

Christian Skärby, Ericsson, for providing me with insight and help with the simulator

tool,

Dharmendra S. Modha, IBM, for supplying me with test data,

John McCarthy, Ericsson, for giving me the opportunity.

Table of content Chapter 1 ....................................................................................................................................... 1

Introduction ............................................................................................................................... 1

Background ........................................................................................................................... 1

Thesis objective ..................................................................................................................... 2

Delimitations ......................................................................................................................... 3

Choice of methodology ......................................................................................................... 3

Chapter 2 ....................................................................................................................................... 5

Memory management ............................................................................................................... 5

The operating system ............................................................................................................ 5

The role of a cache memory .................................................................................................. 6

Memory access patterns ........................................................................................................ 6

Bélády’s algorithm ................................................................................................................ 7

Caching algorithms ................................................................................................................... 8

Conventional replacement algorithms ................................................................................... 8

Contemporary caching algorithms ........................................................................................ 9

Related research fields ............................................................................................................ 13

Chapter 3 ..................................................................................................................................... 15

Implementation ....................................................................................................................... 15

Preliminary process outline ................................................................................................. 15

Choice of algorithms to implement ..................................................................................... 16

Traffic scenarios and simulated input ................................................................................. 16

The prototype model ........................................................................................................... 16

Code verification ................................................................................................................. 17

Results ..................................................................................................................................... 18

Performance analysis .............................................................................................................. 24

Resource Requirements ........................................................................................................... 24

Incorporated network statistics ............................................................................................... 25

Chapter 4 ..................................................................................................................................... 26

The iterative design of an intelligent data caching algorithm ................................................. 26

A simple heuristic based on metric values .......................................................................... 26

Performance results of HBM .............................................................................................. 27

Combining HBM with LRU ................................................................................................ 30

Combining HBM with ARC ............................................................................................... 31

Three-Queue-Metric ............................................................................................................ 32

Latency calculations ................................................................................................................ 37

A comparison to the optimal solution ..................................................................................... 37

Chapter 5 ..................................................................................................................................... 39

Conclusion .............................................................................................................................. 39

Discussion ........................................................................................................................... 39

Recommendations ............................................................................................................... 40

Future work ......................................................................................................................... 40

Bibliography................................................................................................................................ 41

Chapter 1

1

Chapter 1

Chapter 1 serves as an introduction to the degree project where the

background to the problem and overall aim of the thesis report are

presented. It also discusses delimitations and choice of methodology to

accomplish the project goal.

Introduction

This degree project aims to explore the concept of intelligent data caching attained

through the use of network statistics available in the operating system of base stations.

Caching techniques is a valuable tool for most memory management systems but this

study only regards the context of telecommunication. The degree project is carried out

at Ericsson AB; a Swedish telecom company with world wide presence. The company

provides operators with network services and supplies the global market with mobile

technology (Ericsson AB, 2010).

The amount of user data that a base station manages is too large for the operating

system to store in its internal on-chip memory whereby a slow secondary memory is

required. Through the use of a cache replacement policy Ericsson foresees that

bandwidth savings can be accomplished and hence reduce latency (time delay). The task

of the degree project is to investigate whether network statistics can be utilized for this

purpose and suggest a caching algorithm that performs well on a workload

corresponding to that of a Long Term Evolution (LTE) base station.

Background

The thesis concerns LTE technology which is a global standard for wireless

communication in the telecom industry. The first network was launched in December

2009 by TeliaSonera and Ericsson (Ericsson AB, 2012). LTE is in common language

also referred to as 4G although the formal performance criterion for 4G of 1Gbit/s down

link peak data rate for low mobility (Lte World, 2009), established by International

Telecommunication Union (ITU) (International Telecommunication Union, 2012), is

not fulfilled for end customers as of today, March 2012. The enhancement of LTE,

LTE-advanced, is Ericsson’s response to bridge the gap in performance and increase

network capacity.

Chapter 1

2

In an LTE network, each base station has a scheduler that is in charge of scheduling the

current users of that site such that the resources are fairly distributed between the users.

User data is sent from the base station in time slots on the radio link called

Transmission Time Intervals (TTI). User contexts are data contexts of information

about a specific user that is needed when that user is scheduled to send. The user

contexts for all users scheduled in the next TTI are fetched from external memory into

internal memory regardless of the previous memory content which means that no

caching technique is applied.

Although it is not performance critical today, this sub-optimal solution becomes an

issue as the company has set off to increase network capacity and the number of

scheduled users per unit of time. Latency between off-chip and on-chip memory limits

the amount of data that can be fetched for use in a single TTI. The goal is to implement

a caching algorithm that saves as much bandwidth as possible in the given environment,

which means a cache replacement policy with the highest possible cache hit ratio.

Cache hit ratio is defined as the number of cache hits divided by the total number of

memory requests.

One approach is to compare the performance of currently known replacement policies

and choose the most well suited. However, scheduling decisions depend for instance on

the amount of data in the user buffers, the current conditions of the radio link as well as

other scheduling parameters. It is not necessarily the most recently used, or the most

frequently used data that will soon be used again. This means that a conventional cache

replacement algorithm that does not accommodate the specific characteristics of a

telecom network perhaps is not the most optimal. A replacement algorithm processing a

priori statistics extracted from the network could potentially increase cache efficiency

i.e. cache hit ratio compared to the other algorithms.

Thesis objective

This degree project thus intends to investigate if bandwidth savings between on-chip

and off-chip memory can be achieved using a priori statistics when evicting cached

data. This is determined by implementing and evaluating carefully selected algorithms

and, secondly, by implementing potentially improved algorithms processing appropriate

statistics and comparing these amongst each other.

An application is built in order to handle the input as well as the presentation of the

resulting output. All algorithms and the application are written in C++. This

programming language does not come with a garbage collector which facilitates the

control of memory allocation. Post processing, parsing of data from the simulator, and

the graphical presentation of algorithm outputs are handled by Matlab scripts. Matlab is

Chapter 1

3

a tool designed to process and present large amounts of data which makes it a

convincing alternative for this project. In addition, the simulator of LTE traffic already

provides data in Matlab file format.

The ultimate objective is to be able to suggest an appropriate cache replacement

algorithm empirically proven to suit the intended environment of an LTE telecom

network. The degree project must consider and comment on the aspect of potential loss

in time complexity of the investigated replacement algorithms compared to the current

solution and amongst each other. Both memory and processor resources are limiting

factors in a base station.

The degree project also includes research of memory management systems and

replacement algorithms in order to propose and select relevant algorithms to be further

investigated.

Delimitations

The number and choice of algorithms to implement depends on the results of the pre-

study, but also on the time frame of the project. Example traffic data from a simulator is

provided by Ericsson and used as input to the algorithms. The sample of relevant

statistics to be utilized by the algorithms is selected in consultation with Ericsson.

The evaluation and comparison of caching algorithms is limited to a number of

predetermined traffic scenarios. It therefore takes into account the possibility that the

performance of the selected algorithms could vary in different traffic situations. These

traffic scenarios are also determined in consultation with Ericsson.

The conclusions of this degree project assume that both the simulated traffic data and

the traffic scenarios provided by Ericsson are realistic and representative of traffic in the

company’s live LTE network.

Choice of methodology

The conclusions and recommendations of this thesis will be based on a scientific study

using an empirical and experimental approach to achieve scientific validity. The

implementation process of the heuristic-based algorithm will be iterative to allow for

continuous improvements. At the first stage of the project a state-of-the-art evaluation is

performed to be able to take advantage of current research in the field of replacement

algorithms. The results are then compared using a quantitative evaluation method i.e.

using an assessment process that compares discrete numerical values.

Chapter 1

4

The outputs of the algorithms, which are visualized as graph diagrams, are the

calculations of hit ratio as a function of cache size n. At last, latency per TTI is

calculated and presented as the percentage of cache misses in one TTI.

Chapter 2

5

Chapter 2

Chapter 2 intends to lay the foundation of the theory that is essential in

order to understand the problem that this degree project aims to explore.

This includes basic theory about memory management systems, cache

memories and a state-of-the-art evaluation of replacement algorithms with

varying degree of complexity.

Memory management

The operating system

An operating system is the layer of software closest to the hardware and runs in kernel

mode, as oppose to user mode programs. The operating system serves higher-level

software with an abstract set of resources and manages access to hardware on its own. It

has the permission to execute any machine instruction (Tannenbaum, 2009). The

layered communication is portrayed in Figure 1. Hardware components that the

operating system manages include the CPU, memory and I/O devices (Tannenbaum,

2009).

Figure 1: The standard layered structure of communication of a computer architecture.

A process is a key concept enabling several programs to run simultaneously and the

operating system is responsible for scheduling all processes for access to the processor

such that each program receives a fair amount of the available resources. The constant

switching between processes is what gives the appearance of running all processes

simultaneously and while a process is scheduled for processor access, it must also be

allocated a sufficient amount of memory (Tannenbaum, 2009).

Chapter 2

6

The role of a cache memory

In order to somewhat compromise between the requirements of both a fast and

affordable memory that is sufficiently large to store the desired data, the memory

system is most often structured according to a layered hierarchy (Frick, 2009). The

further away from the CPU, the slower and cheaper the memory becomes. The fastest

storage hence is the registers in the CPU with no latency but with less than 1 KB

storage. The next layer is usually a high-speed cache but still relatively small in size and

so forth (Frick, 2009). This hierarchy is illustrated in Figure 2.

Figure 2: A typical memory hierarchy in a memory management system (Frick, 2009).

The idea is that not all information is needed at the same time. Data is fetched on

demand from lower levels to higher levels unless they are already located in the higher

level. This is called a cache hit and entails significant time savings. Data that will soon

be used again should therefore be kept in the cache for better efficiency.

Memory access patterns

A cache replacement algorithm is a policy that decides what to evict when the cache is

full and needs to make room for new data. The suitable replacement algorithm for a

specific implementation depends on the access pattern of the requests of data from

memory. Replacement algorithms are mostly evaluated based on input reflecting

different memory usage patterns, and it has been demonstrated empirically that a set of

the most well-known algorithms perform differently depending on the underlying

pattern of requests (Paajanen, 2007). The evaluation is done by comparing hit ratio as a

function of cache size n (Megiddo & Modha, 2003). Some access patterns indicate for

Chapter 2

7

example locality in time, such as a loop pattern, where the same data will be needed for

every cycle of the loop.

Figure 3: A definition of two common access patterns that can pollute the cache. A scan is a long

sequence of one-time data requests. Thrashing occurs when there is a loop pattern but the number

of elements in the loop is larger than the cache size (Jaleel, Theobald, Steely Jr., & Emer, 2010).

A loop pattern where the number of items in the loop exceeds the cache size is called a

thrashing access pattern (Jaleel, Theobald, Steely Jr., & Emer, 2010), defined in Figure

3a, due to its ability to pollute the cache. Any item in the loop is evicted before it is

used again in the loop. A scan, also called stream on the other hand, does not show

locality in time but it also has the ability to pollute the cache. The pattern, defined in

Figure 3b (Jaleel, Theobald, Steely Jr., & Emer, 2010), refers to a long sequence of

one-time data requests that replace the current and desired content of the cache with

data that will not be requested again. A scan resistant replacement algorithm is one that

is resilient towards this behavior and does not let the cache get polluted by scanning

patterns.

Another example of an access pattern is the correlated pattern (Paajanen, 2007). It

assumes that the same data is requested from memory twice within a short time frame

after which it is not requested again for a long period of time and does not need to

remain in the cache.

Bélády’s algorithm

The most optimal algorithm for any access pattern, as defined by Bélády, is an

algorithm that always evicts the cached data that will next be accessed the furthest away

in time compared to the other content in the cache (Paajanen, 2007). Such an algorithm

is only possible if the future data requests are completely predictable. This is most often

not the case, if ever, but Bélády’s algorithm can be used for reference calculations

(Paajanen, 2007). Replacement algorithms can be evaluated comparing their

performance with how Bélády’s algorithm would have performed in the same situation.

This performance is measured after the algorithms have run and the data requests that

occurred are known.

Chapter 2

8

Caching algorithms

Conventional replacement algorithms

Simple queue-based policies

The most commonly recognized replacement algorithm is the Least Recently Used

(LRU) algorithm illustrated in Figure 4. It always evicts cached data that has not been

used for the longest time (Megiddo & Modha, 2003) and performs considerably well on

many different types of workloads.

The Least Frequently Used (LFU) algorithm is also a straight-forward policy with the

aim to capture frequency instead of recency. The policy maintains a frequency count for

each cached item and evicts the item with the lowest number. The First-In-First-Out

(FIFO) replacement policy is commonly known where the oldest data is always evicted

regardless of when it was last used (Paajanen, 2007). FIFO is not widely implemented

as it has been shown to perform significantly worse than LRU for most memory

management systems (Paajanen, 2007). The queue structure is, however, easier to

implement and lower overhead than LRU since there is no constant need to move

around data inside the cache. A draw-back with LRU and FIFO is that they are not scan

resistant.

Figure 4: An illustration of the LRU replacement policy to the left, and the FIFO replacement

policy to the right. According to the LRU policy an item is moved back to the top of the queue upon

an access request.

Sequenced based

The Sequence based (SEQ) algorithm was proposed as an alternative to LRU (Paajanen,

2007). At most times it functions as an LRU policy but keeps track of cache misses

trying to detect long sequences of them. If such a sequence is detected, the algorithm

switches to a pseudo MRU (Most Recently Used) policy. The idea is thus to address the

bad performance of LRU on scan access patterns and otherwise to function as the LRU

policy normally does (Paajanen, 2007). This will improve LRU scan resistance.

Chapter 2

9

CLOCK

The basic model of the CLOCK algorithm is visible in Figure 5 (Paajanen, 2007).

Cached pages are kept in a circular list where the “clock pointer” points at the oldest

item in the list and each page item has a referenced bit. When an item needs to be

evicted, the algorithm searches for a page with referenced bit set to 0. It starts by

incrementing the clock pointer and if this item has a referenced bit set to 1, the bit is

reset to 0 and the pointer moves one step in the clock circle repeating the steps until an

item that can be replaced is found. The algorithm is low overhead as it does not move

around items between different queues but simply maintains the circle and only moves

the clock pointer. CLOCK is not scan resistant and it is shown later in this report that it

performs similarly to LRU.

Figure 5: The CLOCK model of a cache replacement algorithm where the clock pointer searches

for an item with its referenced bit not set.

Contemporary caching algorithms

Adaptive Replacement Cache

The Adaptive Replacement Cache algorithm (ARC) is based on the idea of combining

two policies in one, self-tuning between the two (Megiddo & Modha, 2003). The

algorithm, developed by IBM researchers Modha and Megiddo, is State-of-Art

acclaimed to perform better than both LRU and LFU. It addresses both recency and

frequency while also providing scan resistance. The advantage of the ARC algorithm,

compared to other replacement algorithms based on similar ideas, is the low-

computational-overhead-cost independent of cache size that the algorithm demonstrates

(Megiddo & Modha, 2003).

ARC maintains two different lists, L1 and L2. L1 contains pages that have been

referenced only once whereas L2 contains pages that have been referenced at least

twice. The data structure of ARC is depicted in Figure 6.

Chapter 2

10

Figure 6: The implementation structure of ARC where cached items are contents of T1 and T2. B1

and B2 keep references of evicted content to be able to adapt to the current access pattern.

The self-tuning property lies in the number of cached entries from each list. If the total

size of the cache is c, then both L1 and L2 will also hold references of c entries each.

This means that the total capacity of the combined lists is 2c. The algorithm chooses

and adjusts the number of cached entries from each list. When the algorithm detects that

most of the cache hits are from for example L1, it will automatically adjust to the

environment caching more pages from that list and consequently fewer pages from L2

(Megiddo & Modha, 2003).

Two Queue

The Two Queue (2Q) algorithm tries to improve the qualities of LRU also by

maintaining two different lists (Paajanen, 2007). One list is implemented as an ordinary

LRU and the other as a FIFO list. The FIFO list is further divided into one Fin list and

one Fout list where the latter one only contains reference information. Operations are

visualized in Figure 7 (Paajanen, 2007).

Figure 7: The operation policy of the 2Q cache replacement algorithm. Contents of the LRU queue

and Fin are cached whereas Fout keeps references of previous content of Fin.

Chapter 2

11

When a page is first accessed it is inserted at the top of the Fin list. When the Fin list

becomes full the last page in Fin is flushed and a reference to it is added to the Fout list.

The page is thus no longer cached but information about it is kept. If a page is accessed

that is referenced to from the Fout list, it is cached again onto the top of the LRU list. If

space becomes available in Fin, a cached page is moved from the end of the LRU list.

The purpose of this algorithm is to provide a scan resistant alternative to the ordinary

LRU policy where the pages referenced to once are quickly removed from the cached

and only truly “hot” pages are kept in the LRU list. The LRU list is larger in size than

Fin, about 3/4 of the total cache size (Paajanen, 2007).

Second Chance-Frequency Least Recently Used

Second Chance-Frequency LRU (SF-LRU) combines the LFU and LRU policies

(Alghazo, Akaaboune, & Botros, 2004). When a page is to be evicted from the cache,

an LRU policy is used to select the least recently used page in the cache. The next step

is to compare this page’s frequency value by LFU calculations with the second recently

used page, as illustrated in Figure 8 (Alghazo, Akaaboune, & Botros, 2004). The page

that is evicted is the one of the two that has the lowest frequency value. If a page is

saved by its frequency value, this value is reset as the page has been given a second

chance but it will not be saved again (Alghazo, Akaaboune, & Botros, 2004). The

algorithm is not scan resistant.

Figure 8: The operation policy of the SF-LRU cache replacement algorithm. A second chance is

given to the last item in the LRU queue by comparing its frequency value to that of the second-last

item.

CLOCK-Pro

The CLOCK-Pro algorithm implements the CLOCK model but with a different

approach to mark which pages that are candidates for eviction (Jiang, Chen, & Zhang,

2005). CLOCK-Pro calculates the “reuse distance” for each page which is the number

of times that any other page has been accessed since this specific page was last

accessed. This distance is used to determine if the page is “hot” or “cold”. Once a page

Chapter 2

12

is brought into the cache memory or degraded from hot to cold it is marked as cold in a

test period. The test period is given to prove its importance (Jiang, Chen, & Zhang,

2005).

Figure 9: The image from Jiang and Zhang (Jiang, Chen, & Zhang, 2005) shows the CLOCK-Pro

cache replacement model. Hot pages are marked with “H”, cold pages marked with “C” The check

marks represent the reference bits of 1.

Cold pages are eligible for eviction by the clock pointer but if a cold page is evicted

during its test period, the reference data is kept in the clock until the end of its test

period. If a cold page is accessed during its test period, it is re-marked as hot but if a test

period end without any memory accesses then the page becomes eligible for eviction

with no test period.

To be able to maintain these markings and make sure that the markings are up-to-date,

the CLOCK-Pro algorithm includes three different clock pointers in the clock circle; the

test, hot and cold pointers, that all have different marking and incrementing rules. The

modified CLOCK-Pro model is portrayed in Figure 9 (Jiang, Chen, & Zhang, 2005).

Dueling CLOCK

The Dueling CLOCK (DC) policy was designed to alter between a CLOCK model and

a scan resistant version of the CLOCK model (Janapsatya, Ignjatovic, Peddersen, &

Parameswaran, 2010). To add scan resistance to the CLOCK model, the “clock pointer“

is not incremented before searching for a page with referenced bit set to 0, the last

replaced data is also eligible for eviction. The DC algorithm self-tunes adapting

dynamically to the current behavior by alternating between these two models

(Janapsatya, Ignjatovic, Peddersen, & Parameswaran, 2010).

Chapter 2

13

Figure 10: The two CLOCK models that the CAR algorithm maintains. The image is from Modha

and Bansal, (Bansal & Modha, 2004). Additional characteristics of the algorithms are very similar

to ARC.

CLOCK with Adaptive Replacement

CLOCK with Adaptive Replacement (CAR) is also a State-of-Art replacement policy

inspired by ARC (Bansal & Modha, 2004). CAR maintains two “clocks” that are shown

in Figure 10, T1 that captures recency as in its original model and T2 that instead

captures frequency but otherwise functions the same. Similar to ARC, two lists B1 and

B2 are introduced to maintain the history information of the previous lists. B1 and B2 do

not contain data content; only reference information and these lists also have a

replacement policy. The algorithm uses B1 and B2 to dynamically adapt the sizes of T1

and T2. The implementation is therefore scan-resistant low overhead and self-tuned,

again similarly to ARC (Bansal & Modha, 2004).

CAR with Temporal filtering (CART) is an additional enhancement of the CAR

algorithm with the objective to be resistant to pollution of the correlated access pattern.

The temporal filter is added as a stricter rule for a page to advance from the recency

clock to the frequency clock. Two memory accesses within a short period of time are

not enough for a page to be considered frequent. CART therefore has all of the

advantages of the CAR algorithm but with an extra improvement that neither CAR nor

ARC incorporates (Bansal & Modha, 2004).

Related research fields

Research on caching techniques for traditional memory management systems, with a

memory hierarchy analogous to Figure 2, has been ongoing for many years. However,

recent advancements in mobile technology have boosted demand for research in new

Chapter 2

14

areas of application. Ye, Li and Chen, for instance, state that “Recently, multimedia data

caching is getting more attention” (Ye, Li, & Chen, 2007). The environment refers to

data caching between streaming servers and base stations which becomes necessary as

people tend to consume and stream more media content through their mobile devices.

Data traffic in mobile networks has quickly augmented over the past decade, and

therefore one can presume that an even more pressing demand for innovative caching

solutions will arise in the future.

Chapter 3

15

Chapter 3

Chapter 3 accounts for the first implementation phase of the degree

project. The development process, the prototype model and obtained

performance results are described. Implemented algorithms are validated

and network statistics is chosen to suggest how to incorporate statistics in

a caching policy.

Implementation

Preliminary process outline

The implementation process model that the degree project follows is outlined in Figure

11. The iterative development process is highlighted and allows for continuous

improvements and evaluation of new ideas. Once this process is completed, the final

results are compared to Bélády’s algorithm to see how it differs from an optimal

solution.

Figure 11: An iterative implementation process model of the project task.

Chapter 3

16

Choice of algorithms to implement

The selection of algorithms to evaluate in the first step of the process can be seen in

Figure 12. They have been chosen as a result of the pre-study to facilitate the detection

of access patterns or evaluation of current research in the field.

Figure 12: Algorithms chosen for the empirical study.

Traffic scenarios and simulated input

The evaluation of the implemented replacement algorithms is based on the four

different traffic scenarios, or user types, seen in Figure 13. FTP stands for File Transfer

Protocol. The scenarios are simulated for a base station with three cells that hold a large

number of simultaneous users. A cell with only a few simultaneous users is not relevant

to this project as latency between on-chip and off-chip memory is presumed to be

performance critical only for a large number of simultaneous users. The mixed scenario

that includes all different user types is most important for the final performance

evaluation as it emulates a realistic scenario. The proportion of different user types in

the mixed scenario is 60% Web browsing, 20% VoIP and 20% FTP download.

Figure 13: User types in a telecom network that are simulated as different traffic scenarios. The

proportion of different user types in the mixed scenario is 60% Web browsing, 20% VoIP and 20%

FTP download.

The prototype model

The application is implemented according to the model seen in Figure 14. The main

application processes the requests in the TTI’s according to the chosen cache strategies.

These strategies can be added or deleted on demand. A hash table is also mapped upon

the double linked list such that each access to an item is constant in time.

Chapter 3

17

Figure 14: Basic model of the implemented application and its interaction with the algorithms.

Code verification

The final implementation code for all algorithms is reviewed by Ericsson personnel in

order to ensure that the algorithms are correctly implemented according to their

respective pseudo code.

In addition, LRU, ARC, CLOCK and CAR are verified against the test data workloads

provided by IBM, (Megiddo & Modha, 2003). These workloads are further explained in

research paper: ARC: A SELF-TUNING, LOW OVERHEAD REPLACEMENT CACHE

(Megiddo & Modha, 2003). Tested workloads are: P3, P4, P5, P6 and P8. Figure 15 and

Figure 16 are examples of the compared graphs for workload P6.

The correct curves are obtained for all four algorithms and single data points in the

graph differ in value by at most 0.4 percentage points. Therefore the implemented

algorithms and the application can be considered validated with no significant logical or

practical error according to the requirements of both KTH and Ericsson. Figure 15 and

Figure 16 also emphasize the similarities in performance of ARC and CAR as well as

LRU and CLOCK.

Chapter 3

18

Figure 15: Code verification. Hit ratio results on workload P6 of the implemented versions of CAR

and CLOCK are shown to the left. The results correspond with the official graph of (Bansal &

Modha, 2004) to the right.

Figure 16: Code verification. Hit ratio results on workload P6 of the implemented versions of ARC

and LRU are shown to the left. The results correspond with the official graphs of (Megiddo &

Modha, 2003) to the right.

Results

The output graphs for all different scenarios are presented in Figures 17-24. The graphs

of seven algorithms are divided into two groups for convenience.

Please note that some algorithms that tend to perform similarly also result in similar

graphs. These curves can be difficult to separate from one another in the figures below.

The main intention by presenting the graphs in this report is, however, to show the

reader which curves that do or do not differ from each other, not to interpret the exact

numerical values.

Chapter 3

19

Please also note that the graph of LFU in Figure 17 does not approach a hit ratio of

100% which is correct behavior. A disadvantage of LFU is that former frequent items

that now should be evicted have a much higher frequency count than newly added items

and therefore remain cached unnecessary long. In the VoIP scenario new users are

added at a higher rate than the other scenarios and hence suffer from this disadvantage

which is the reason why the graph does not reach 100%. The number of active users in

the system remains constant but new users are added as old users leave.

Figure 17: Random, FIFO and LFU hit ratios for the VoIP traffic scenario. FIFO suffers from

thrashing for small cache sizes.

Chapter 3

20

Figure 18: LRU, ARC, CLOCK and CAR hit ratios for the VoIP traffic scenario. LRU and

CLOCK suffer from thrashing for small cache sizes. ARC and CAR demonstrate unstable

behavior.

Chapter 3

21

Figure 19: Random, FIFO and LFU hit ratios for the Web traffic scenario.

Figure 20: LRU, ARC, CLOCK and CAR hit ratios for the Web traffic scenario.

Chapter 3

22

Figure 21: Random, FIFO and LFU hit ratios for the FTP traffic scenario.

Figure 22: LRU, ARC, CLOCK and CAR hit ratios for the FTP traffic scenario.

Chapter 3

23

Figure 23: Random, FIFO and LFU hit ratios for the Mix traffic scenario.

Figure 24: LRU, ARC, CLOCK and CAR hit ratios for the Mix traffic scenario.

Chapter 3

24

Performance analysis

The results indeed imply that the access patterns vary for the different scenarios. It is

important to find an algorithm that performs considerably well for all scenarios,

especially the Mix as all user types are present in the telecom network.

It is evident that ARC and CAR show similar and mostly good performance but lack in

stability. Figure 18, for example, indicate that the algorithms occasionally suffer from

thrashing on the specific workload but only for some cache sizes. This behavior is not

desired in the intended environment. It is essential that the algorithm behavior is

somewhat predictable such that the performance should not worsen with a larger cache

size.

The graphs also infer that LRU, CLOCK and FIFO suffer from thrashing for smaller

cache sizes on the VoIP pattern. In general the VoIP traffic scenario seems to be more

difficult for a caching policy to manage whereas Web, for instance, is much easier.

The analysis concludes that, at this point, Random would be the best choice of an

algorithm that is stable and relatively good for all scenarios. CLOCK or LRU are also

good alternatives despite the lousy performance on VoIP. LFU demonstrates bad

performance on all scenarios except VoIP, where LRU is much worse, which stresses

that a well customized algorithm should in fact address both recency and frequency to

accommodate all access patterns. As mentioned above, ARC and CAR do demonstrate

good performance but the instability is not acceptable for the intended purpose.

Resource Requirements

Time complexity of the algorithms will depend slightly on design choices of the

implementations. It is not certain that these specific implementations of the algorithms

are applicable to the environment of the operating system in the base stations. Ericsson

is therefore advised to also, if relevant, analyze the algorithms with respect to the

possibilities and limitations of their own software and hardware.

Note that it is not critical that the prototype implementation is optimal as the aim is to

evaluate the output of the algorithms and not measure running time. Nonetheless, in this

implementation, most algorithms use a hash table which is mapped onto a double linked

list such that checking for a cache hit and accessing an item is constant in time but it

requires the additional data structure of a hash table. Time complexities and memory

requirements for all seven algorithms are found in Table 1.

Chapter 3

25

Algorithm / Method lookup update replace Memory

requirement

LRU O(1) O(1) O(1) c

LFU O(1) O(nlogn) O(nlogn) c

CLOCK O(1) O(1) O(n) c

ARC O(1) O(1) O(1) 2c

CAR O(1) O(1) O(n) 2c

FIFO O(1) O(1) O(1) c

Random O(1) O(1) O(n) c

Table 1: Time complexities of the three main methods, and memory requirements for cache size c.

CLOCK and CAR have a worst-case time complexity for replace of O(n) due to the

clock search of a false reference bit. Random has the same complexity for replace as

CLOCK and CAR because it iterates to the randomly generated position in the list. LFU

has a has a theoretical update and replace time complexity of O(logn) due to heap

traversal but the current implementation of converting between iterators and positions

lead to an actual complexity of O(nlogn).

ARC, on the other hand, is constant but it must be emphasized that the algorithm is

constant with respect to cache size n but it requires more clock cycles compared to the

other algorithms as more operations need to be performed. ARC and CAR, in addition

to 2Q and CLOCK-Pro, also require more memory due to the task of keeping track of

history information. These are definitely disadvantages of ARC and CAR as both

memory and performance resources of the base stations are limited.

Incorporated network statistics

The algorithms that incorporate network statistics employ so-called metric values for

each user which are updated for every TTI. It is not disclosed in this thesis report how

the metric values are calculated and on which scheduling parameters they depend upon.

It is a combination of various aspects such as the user type, reuse distance and other

radio conditions. These values can, however, be regarded as weights that hypothetically

are assumed to increase as the probability of being scheduled increases. The degree

project determines whether the hypothetical relation is supported or not by the results of

the algorithms.

Chapter 4

26

Chapter 4

Chapter 4 describes the second implementation phase. These replacement

algorithms utilize network statistics with the intention to customize for the

telecom environment. A new algorithm named Three-Queue-Metric

Algorithm is designed. Results are compared to some of the previous

algorithms and ultimately to the optimal solution.

The iterative design of an intelligent data

caching algorithm

A simple heuristic based on metric values

The first algorithm to be evaluated is one that uses the heuristic of evicting the cached

user context with the least probability of being scheduled according to the hypothesis.

The hypothesis states that the user context with the least metric value has the least

probability of being scheduled. If there is more than one with the same metric value,

one is chosen at random. The algorithm will from here on be called Heuristic-Based

Metric Algorithm (HBM).

Results for the four traffic scenarios are shown in Figures 25-28. It is clear that HBM is

stable and performs well on a workload corresponding to all scenarios although it is not

the best one in every scenario. It follows the same shape of curve as Random, but with

improved performance, and it can hence be deduced that the metric values do show

correlation to probability of future scheduling decisions. Otherwise HBM would be no

better than pure randomness. This leaves room for possible improvements. As discussed

previously, a well suited algorithm should also attempt to capture both recency and

frequency for better performance, which will have to be incorporated. Another

disadvantage is the time complexity. Every replace operation traverses through all

cached items and it is therefore linear, not only in the worst-case scenario.

Analysis of the results thus concludes that although the algorithm demonstrates good

stability and performance, an algorithm with better complexity and/or improved

performance on certain workloads would be preferable. This can perhaps be achieved

by addressing recency and/or frequency better. The next section describes several

unsuccessful attempts that have been made to achieve better results. Although not very

Chapter 4

27

successful, they did lead to valuable reflections that spurred the design of a more

successful attempt that will be discussed later.

By previous analysis it is assumed that ARC and CAR have similar characteristics as do

CLOCK and LRU. Therefore only ARC and LRU will be regarded in the following

investigation with the aim to improve HBM with respect to hit ratio and the number of

required clock cycles. Random and FIFO are disregarded because HBM already

performs better.

Performance results of HBM

Please see the next page.

Chapter 4

28

Figure 25: Performance of HBM for the VoIP traffic scenario.

Figure 26: Performance of HBM for the Web traffic scenario.

Chapter 4

29

Figure 27: Performance of HBM for the FTP traffic scenario.

Figure 28: Performance of HBM for the Mix traffic scenario.

Chapter 4

30

Combining HBM with LRU

Due to the above analysis, it seems plausible that both performance and stability is

achievable perhaps by incorporating a metric-based heuristic into ARC or LRU. This

section concerns LRU. Two new algorithms named LRU-M and M-LRU are designed

to combine LRU and HBM. They now intend to capture the predictability of metric

values as well as recency. The logical models of these algorithms are shown in Figure

29.

LRU-M M-LRU

FIGURE 29: LRU-M is illustrated to the left and M-LRU is illustrated to the right. LRU-M evicts

the item with the least metric value from a 10 percent selection of the least-recently-used items.

LRU-M is designed to reduce the number of cached items that have to be looked at for

each replacement operation. M-LRU, on the other hand, is designed to see if

performance of HBM could improve with the LRU approach on items with the same

least metric value. Unfortunately, neither LRU-M nor M-LRU results in a better overall

performance than HBM. Results are visible in Figure 31. Some graphs from here on in

the report have been omitted due to redundancy or lack of contribution value to the

analysis.

FIGURE 30: ARC-M operates similarly to ARC except the type of decision policy in T1 and T2.

Chapter 4

31

Figure 31: Performance of LRU-M and M-LRU for the Mix traffic scenario. Neither one shows

better performance than HBM.

Combining HBM with ARC

To combine advantages of both HBM and ARC, ARC-M is designed according to the

model in Figure 30. T1 and T2 are HBM queues instead of LRU queues whenever the

ARC policy chooses to replace an item in either one of them. The task of T2 is still to

capture frequency.

Furthermore, ARC-MT1 is designed to determine whether the metric-based heuristic

also is preferable over an LRU queue for the T2. The model is visualized in Figure 32.

T2 remains as an LRU queue whereas T1 is implemented as an HBM queue.

The hit ratios of ARC-M and ARC-MT1, depicted in Figure 33 and Figure 34, show

that ARCM-T1, ARC-M and ARC all share similar performance. This means that both

newly designed algorithms are also considered instable. Previous conclusions, though,

suggest that recency and frequency should be addressed but the new results infer that

the adaptive characteristics also entail instability. This conclusion incited the attempt to

design an algorithm that addresses frequency, recency and metric values separately in

fixed sizes with the hypothesis that the sizes perhaps do not need to be adaptive.

Chapter 4

32

FIGURE 32: ARC-MT1 applies a metric-based queue only to T1 whereas T2 remains an LRU

queue.

Three-Queue-Metric

Three-Queue-Metric (3QM) is first designed to consist of two queues, one metric-based

queue and one LRU queue. ARC is self adaptive and keeps track of history information.

The goal is to design a more stable algorithm on the specific access patterns, which

ARC and the variants proved not to be, and therefore it does not incorporate any

adaptive characteristics. As a consequence, the sizes of the queues are constant and have

to be tuned.

The problem with this structure is that items from the HBM queue are added to the LRU

queue upon access and are obligated to remain there until they reach the LRU position.

This means that the separation between the queues does not function properly and the

results are dissatisfactory.

Inspired by CART, that uses a temporal filter to separate the two clock models (Bansal

& Modha, 2004), a filter queue is added in between the two queues with the purpose to

filter through only the truly frequent items to the LRU queue and send the other items

back to the HBM queue. The operations of 3QM are found in Figure 35. The filter

queue is also an LRU queue. Any item in it will have to be added to HBM queue before

it can be considered for eviction. At this stage the new and therefore more accurate

metric value is regarded and the current metric value in the filter hence does not matter.

In order for the filter to fulfill its purpose it is necessary that its size is relatively small.

Different sizes of the three queues with a total cache size of c were tested and it was

found that a good proportion division of the cache size c in percentages is 50-20-30 for

each queue respectively.

Chapter 4

33

Figure 33: Performance ARC-M and ARC-MT1 for the VoIP traffic scenario.

Figure 34: Performance ARC-M and ARC-MT1 for the Mix traffic scenario.

Chapter 4

34

FIGURE 35: Structure and filter operations of the 3QM cache replacement algorithm. An access

request in the HBQ queue moves the item to the filter queue and, if accessed again, to the LRU

queue.

The theoretical proof of an optimal partitioning is beyond the scope of this degree

project. It is my belief, however, that the sizes depend on the proportion of different

user types in the network. As discussed, Figures 25-27 indicate that VoIP exhibits

locality in frequency intervals whereas FTP and Web exhibit locality of time and metric

values. The intention of 3QM is to separate these different users according to Figure 36

such that VoIP users are moved to the LRU queue while the other users remain in the

HBM queue.

FIGURE 36: Logical idea of the 3QM cache replacement algorithm. Frequency is address by

allowing frequent items to be inserted into the LRU queue to the right.

The result of 3QM compared to HBM is presented in Figures 37-40. 3QM performs

better than HBM in almost all areas of the graphs yet demonstrating better stability than

ARC and CAR. As intended, the best increase in performance compared to HBM is for

VoIP users. Moreover, although the algorithm has the same theoretical and linear time

complexity as HBM; it will only have to traverse through half of the cached items for

every replacement operation and therefore halves the number of required operations.

Additionally, there is no need for extra history information because sizes are not

adapted online which minimizes memory consumption. Instead, the queue sizes have

been adapted offline to conform to the intended environment.

Chapter 4

35

Figure 37: Performance of 3QM for the VoIP traffic scenario. 3QM performs better than HBM and

LRU.

Figure 38: Performance of 3QM for the Web traffic scenario.

Chapter 4

36

Figure 39: Performance of 3QM for the FTP traffic scenario.

Figure 40: Performance of 3QM for the Mix traffic scenario.

Chapter 4

37

3QM is scan resistant as one-time request items do not advance to the next queues and

are therefore eligible for eviction immediately after they have been added to the cache.

Latency calculations

The total latency for each caching algorithm and cache size n is proportional to the

cache miss ratio and therefore inversely proportional to the hit ratio. The graphs visible

in Figure 41, on the other hand, show miss ratio per TTI. It should not be assumed that

a total cache hit ratio of 30 percent entail a cache hit ratio of 30 percent for each TTI.

For convenience, only every 400th

TTI is marked on the graphs and the graphs do tend

to fluctuate over a considerably large range of values. It is therefore emphasized that

there is no guarantee that there will not be any overflow from one TTI to the next but

rather that the total latency is minimized to a certain value.

Figure 41: Final latency calculations for the Mix traffic scenario shown as miss ratio per TTI. The

fluctuations emphasize the variation of hit ratio from one TTI to another. Hit ratio results are only

a measure of average performance.

A comparison to the optimal solution

Figure 42 shows the performance of 3QM in comparison to Bélády’s algorithm. As

discussed previously, the optimal caching algorithm is not realistically achievable if the

Chapter 4

38

sequence of requests is unknown. Therefore 3QM can be considered to perform

relatively well despite the gap between the two graphs.

Figure 42: A comparison of the 3QM cache replacement algorithm to Bélády’s optimal algorithm.

.

Chapter 5

39

Chapter 5

Chapter 5 is the last chapter and concludes the degree project with a

discussion of the results obtained throughout the course of the study and

recommendations for Ericsson.

Conclusion

Discussion

This degree project empirically proves that network statistics in the operating system of

a base station can be utilized when evicting cached user contexts. Figure 39, for

instance, shows the final performance results of 3QM and HBM for the mixed traffic

scenario. Latency can almost be halved using a cache size of 1/10 of the total number of

users; a result that was not achievable with any of the algorithms that do not take

advantage of a priori statistics.

In addition to network statistics, an effective algorithm should also try to capture

recency and frequency. The access pattern of memory requests vary between the

different traffic scenarios and these need to be addressed separately. VoIP users, for

example, tend to demonstrate a pattern of frequent intervals whereas network statistical

weights are particularly good for FTP user scenarios. 3QM was designed to filter

between the two and apply different policies to these queues.

The disadvantage of addressing network statistics in the caching algorithm is the time

complexity. The metric values are updated every TTI and the complexity is thus linear.

LRU or Random has much better performance with respect to the number of required

clock cycles for each replacement operation. 3QM only looks at the metric values of

half of the cached items and therefore outperforms HBM in both hit ratio and

complexity.

Although ARC and CAR are state-of-the art within the area of caching algorithms and

indeed perform well, they are not stable enough on this specific input data and therefore

do not meet the requirement of a suitable algorithm for the intended purpose. It should

also be noted that ARC is patented by IBM.

Chapter 5

40

As discussed, it has been concluded that the obtained cache hit ratio is only guaranteed

as an average but not for a single TTI. Overflow from one TTI to the next is therefore

possible independent of total cache hit ratio.

Recommendations

Due to the aspects of the above discussion, Ericsson is advised to choose a caching

algorithm between 3QM, Random or LRU depending on the most limiting factor. 3QM

performs well across all traffic scenarios and has the highest hit ratio but relatively bad

complexity. Random has the lowest overhead and is stable in all scenarios but has

significantly less hit ratio performance. LRU is low overhead with average performance

(if it is implemented such that moving and accessing an item is constant) but suffers

from pollution in the VoIP scenario.

Future work

The topic of intelligent caching algorithms and intelligent algorithms, in general, is very

promising. My belief after having completed this degree project is that intelligent

algorithms will be incorporated into even more areas of society as the amount of data to

process and the number of dimensions to consider is becoming too large for humans to

interpret. As mentioned, research is ongoing regarding caching algorithms and content

media streaming (Ye, Li, & Chen, 2007), and the current trend in data caching seems to

be more intelligent solutions customized for the intended environment.

An idea that stroked my mind during the pre-study is that perhaps Hidden-Markov-

Models could be used to predict future requests depending on the current sequence of

requests if modeled as chains. It is my belief that some attempts have been made to

combine caching algorithms and Markov models and perhaps a telecom environment is

not the best environment to model accordingly. This question will however be left for

future studies to determine.

41

Bibliography

Alghazo, J., Akaaboune, A., & Botros, N. (2004). SF-LRU Cache Replacement Algorithm.

Carbondale: Southern Illinois University at Carbondale.

Bansal, S., & Modha, D. S. (2004). CAR: Clock with Adaptive Replacement. San Francisco, CA:

USENIX Conference on File and Storage Technologies (FAST 04).

Ericsson AB. (December 2010). Ericsson. Retrieved from Company Facts:

www.ericsson.com/thecompany/company_facts on July 14th 2012

Ericsson AB. (December 2012). LTE Brochure. Retrieved from LTE: A Global Success Story:

http://www.ericsson.com/res/docs/2012/erix1202_lte_brochure.pdf on July 14th 2012

Frick, I. (2009). Course material: DD2486 Systemprogrammering. Lecture 4: Minneshierarki,

lokalitet och virtuellt minne. Stockholm: KTH Royal Institute of Technology.

International Telecommunication Union. (on July 17th 2012). ITU. Retrieved from About ITU:

www.itu.int/net/about/mission.aspx on July 17th 2012

Jaleel, A., Theobald, K. B., Steely Jr., S. C., & Emer, J. (2010). High Performance Cache

Replacement Using Re-Reference Interval Prediction (RRIP). Saint-Malo, France: ISCA.

Janapsatya, A., Ignjatovic, A., Peddersen, J., & Parameswaran, S. (2010). Dueling CLOCK:

Adaptive Replacement Policy Based on The CLOCK Algorithm. Leuven, Belgium: European

Design and Automation Association.

Jiang, S., Chen, F., & Zhang, X. (2005). CLOCK-Pro: An Effective Improvement of the CLOCK

Replacement. Berkeley: USENIX Association Berkeley.

Lte World. (August 2009). LTE Advanced. Retrieved from LTE Advanced:

http://lteworld.org/wiki/lte-advanced on March 15th 2012

Megiddo, N., & Modha, D. S. (2003). ARC: A Self-Tuning, Low Overhead Replacement Cache.

San Francisco, CA: USENIX Conference on File and Storage Technologies (FAST 03).

Paajanen, H. (2007). Page replacement in operating system memory. Jyväskylä: University of

Jyväskylä, Department of Mathematical Information Technology.

Tannenbaum, A. S. (2009). Modern Operating Systems (Vol. 3rd Edition). Saddle River, N.J:

Pearson.

Ye, F., Li, Q., & Chen, E. (2007). An Evolution-Based Cache Scheme for Scalable Mobile Data

Access. Suzhou, China: INFOSCALE.

http://www.ericsson.com/res/docs/2012/erix1202_lte_brochure.pdf

http://lteworld.org/wiki/lte-advanced

TRITA-CSC-E 2012:077 ISRN-KTH/CSC/E--12/077-SE

ISSN-1653-5715

www.kth.se

Documents

Intelligent Data Caching in the LTE Telecom · PDF fileIntelligent Data Caching in the LTE Telecom Network ... 2009 by TeliaSonera and Ericsson (Ericsson AB, 2012). LTE is in common