Upload
dothuan
View
250
Download
0
Embed Size (px)
Citation preview
Intelligent Data Caching in the LTE Telecom Network
E L I N A M E I E R
Master of Science Thesis Stockholm, Sweden 2012
Intelligent Data Caching in the LTE Telecom Network
E L I N A M E I E R
DD221X, Master’s Thesis in Computer Science (30 ECTS credits) Degree Progr. in Computer Science and Engineering 300 credits Royal Institute of Technology year 2012 Supervisor at CSC was Alexander Baltatzis Examiner was Olle Bälter TRITA-CSC-E 2012:077 ISRN-KTH/CSC/E--12/077--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc
Intelligent Data Caching in the LTE Telecom
Network
Abstract
This degree project, conducted at Ericsson AB, investigates if an intelligent caching
algorithm can achieve better cache hit ratio than a general replacement algorithm in the
Long Term Evolution (LTE) telecom network environment. The intelligence refers to
the use of network statistics available in the operating system. The project includes
research on memory management systems, memory access patterns and caching
algorithms. A new algorithm, Three-Queue-Metric (3QM), is designed to conform to
the characteristics of the access patterns demonstrated on different traffic scenarios. It is
empirically shown that 3QM performs better than all other algorithms included in the
study on a workload specific to the LTE network. It is therefore concluded that network
statistics indeed can be part of a process to improve cache performance.
En statistik-baserad cachingalgoritm för LTE-
basstationer
Sammanfattning
Detta examensarbete, utfört på Ericsson AB, ämnar undersöka huruvida en intelligent
cachingalgorithm kan förbättra cacheminnets prestanda i ett Long Term Evolution
(LTE) nätverk jämfört med en generell utbytesalgoritm. Intelligensen syftar till
användning av nätverkstatistik som finns tillgängligt i operativsystemet tillhörande en
basstation. Arbetet innefattar även en undersökning av minneshanteringssystem,
minnesåtkomstmönster samt utbytesalgoritmer. Studien leder till att en ny algoritm,
kallad Three-Queue-Metric (3QM), designas med avseende att tillgodose alla
accessmönster som uppvisats av de olika trafikscenarion och utvärderade algoritmer.
Det fastslås empiriskt att 3QM påvisar bättre prestanda än någon av de andra
implementerade algoritmerna på de specifika indata som motsvarar ett LTE nätverk.
Därför dras slutsatsen att nätverksstatistik kan användas för att förbättra cacheminnets
prestanda.
Preface
This degree project is the final task of my Master in Computer Science and I would like
to thank every one that has helped and supported me throughout the project. I have had
a wonderful time carrying out this intriguing task.
A special thank to:
Kenneth Hilmersson, Per-Olof Gatter and Arvid Persson, Ericsson, for their
invaluable advice, effort and enhance,
Alexander Baltatzis and Olle Bälter, KTH, for their support, engagement and for
taking me on as a thesis student,
Christian Skärby, Ericsson, for providing me with insight and help with the simulator
tool,
Dharmendra S. Modha, IBM, for supplying me with test data,
John McCarthy, Ericsson, for giving me the opportunity.
Table of content Chapter 1 ....................................................................................................................................... 1
Introduction ............................................................................................................................... 1
Background ........................................................................................................................... 1
Thesis objective ..................................................................................................................... 2
Delimitations ......................................................................................................................... 3
Choice of methodology ......................................................................................................... 3
Chapter 2 ....................................................................................................................................... 5
Memory management ............................................................................................................... 5
The operating system ............................................................................................................ 5
The role of a cache memory .................................................................................................. 6
Memory access patterns ........................................................................................................ 6
Bélády’s algorithm ................................................................................................................ 7
Caching algorithms ................................................................................................................... 8
Conventional replacement algorithms ................................................................................... 8
Contemporary caching algorithms ........................................................................................ 9
Related research fields ............................................................................................................ 13
Chapter 3 ..................................................................................................................................... 15
Implementation ....................................................................................................................... 15
Preliminary process outline ................................................................................................. 15
Choice of algorithms to implement ..................................................................................... 16
Traffic scenarios and simulated input ................................................................................. 16
The prototype model ........................................................................................................... 16
Code verification ................................................................................................................. 17
Results ..................................................................................................................................... 18
Performance analysis .............................................................................................................. 24
Resource Requirements ........................................................................................................... 24
Incorporated network statistics ............................................................................................... 25
Chapter 4 ..................................................................................................................................... 26
The iterative design of an intelligent data caching algorithm ................................................. 26
A simple heuristic based on metric values .......................................................................... 26
Performance results of HBM .............................................................................................. 27
Combining HBM with LRU ................................................................................................ 30
Combining HBM with ARC ............................................................................................... 31
Three-Queue-Metric ............................................................................................................ 32
Latency calculations ................................................................................................................ 37
A comparison to the optimal solution ..................................................................................... 37
Chapter 5 ..................................................................................................................................... 39
Conclusion .............................................................................................................................. 39
Discussion ........................................................................................................................... 39
Recommendations ............................................................................................................... 40
Future work ......................................................................................................................... 40
Bibliography................................................................................................................................ 41
Chapter 1
1
Chapter 1
Chapter 1 serves as an introduction to the degree project where the
background to the problem and overall aim of the thesis report are
presented. It also discusses delimitations and choice of methodology to
accomplish the project goal.
Introduction
This degree project aims to explore the concept of intelligent data caching attained
through the use of network statistics available in the operating system of base stations.
Caching techniques is a valuable tool for most memory management systems but this
study only regards the context of telecommunication. The degree project is carried out
at Ericsson AB; a Swedish telecom company with world wide presence. The company
provides operators with network services and supplies the global market with mobile
technology (Ericsson AB, 2010).
The amount of user data that a base station manages is too large for the operating
system to store in its internal on-chip memory whereby a slow secondary memory is
required. Through the use of a cache replacement policy Ericsson foresees that
bandwidth savings can be accomplished and hence reduce latency (time delay). The task
of the degree project is to investigate whether network statistics can be utilized for this
purpose and suggest a caching algorithm that performs well on a workload
corresponding to that of a Long Term Evolution (LTE) base station.
Background
The thesis concerns LTE technology which is a global standard for wireless
communication in the telecom industry. The first network was launched in December
2009 by TeliaSonera and Ericsson (Ericsson AB, 2012). LTE is in common language
also referred to as 4G although the formal performance criterion for 4G of 1Gbit/s down
link peak data rate for low mobility (Lte World, 2009), established by International
Telecommunication Union (ITU) (International Telecommunication Union, 2012), is
not fulfilled for end customers as of today, March 2012. The enhancement of LTE,
LTE-advanced, is Ericsson’s response to bridge the gap in performance and increase
network capacity.
Chapter 1
2
In an LTE network, each base station has a scheduler that is in charge of scheduling the
current users of that site such that the resources are fairly distributed between the users.
User data is sent from the base station in time slots on the radio link called
Transmission Time Intervals (TTI). User contexts are data contexts of information
about a specific user that is needed when that user is scheduled to send. The user
contexts for all users scheduled in the next TTI are fetched from external memory into
internal memory regardless of the previous memory content which means that no
caching technique is applied.
Although it is not performance critical today, this sub-optimal solution becomes an
issue as the company has set off to increase network capacity and the number of
scheduled users per unit of time. Latency between off-chip and on-chip memory limits
the amount of data that can be fetched for use in a single TTI. The goal is to implement
a caching algorithm that saves as much bandwidth as possible in the given environment,
which means a cache replacement policy with the highest possible cache hit ratio.
Cache hit ratio is defined as the number of cache hits divided by the total number of
memory requests.
One approach is to compare the performance of currently known replacement policies
and choose the most well suited. However, scheduling decisions depend for instance on
the amount of data in the user buffers, the current conditions of the radio link as well as
other scheduling parameters. It is not necessarily the most recently used, or the most
frequently used data that will soon be used again. This means that a conventional cache
replacement algorithm that does not accommodate the specific characteristics of a
telecom network perhaps is not the most optimal. A replacement algorithm processing a
priori statistics extracted from the network could potentially increase cache efficiency
i.e. cache hit ratio compared to the other algorithms.
Thesis objective
This degree project thus intends to investigate if bandwidth savings between on-chip
and off-chip memory can be achieved using a priori statistics when evicting cached
data. This is determined by implementing and evaluating carefully selected algorithms
and, secondly, by implementing potentially improved algorithms processing appropriate
statistics and comparing these amongst each other.
An application is built in order to handle the input as well as the presentation of the
resulting output. All algorithms and the application are written in C++. This
programming language does not come with a garbage collector which facilitates the
control of memory allocation. Post processing, parsing of data from the simulator, and
the graphical presentation of algorithm outputs are handled by Matlab scripts. Matlab is
Chapter 1
3
a tool designed to process and present large amounts of data which makes it a
convincing alternative for this project. In addition, the simulator of LTE traffic already
provides data in Matlab file format.
The ultimate objective is to be able to suggest an appropriate cache replacement
algorithm empirically proven to suit the intended environment of an LTE telecom
network. The degree project must consider and comment on the aspect of potential loss
in time complexity of the investigated replacement algorithms compared to the current
solution and amongst each other. Both memory and processor resources are limiting
factors in a base station.
The degree project also includes research of memory management systems and
replacement algorithms in order to propose and select relevant algorithms to be further
investigated.
Delimitations
The number and choice of algorithms to implement depends on the results of the pre-
study, but also on the time frame of the project. Example traffic data from a simulator is
provided by Ericsson and used as input to the algorithms. The sample of relevant
statistics to be utilized by the algorithms is selected in consultation with Ericsson.
The evaluation and comparison of caching algorithms is limited to a number of
predetermined traffic scenarios. It therefore takes into account the possibility that the
performance of the selected algorithms could vary in different traffic situations. These
traffic scenarios are also determined in consultation with Ericsson.
The conclusions of this degree project assume that both the simulated traffic data and
the traffic scenarios provided by Ericsson are realistic and representative of traffic in the
company’s live LTE network.
Choice of methodology
The conclusions and recommendations of this thesis will be based on a scientific study
using an empirical and experimental approach to achieve scientific validity. The
implementation process of the heuristic-based algorithm will be iterative to allow for
continuous improvements. At the first stage of the project a state-of-the-art evaluation is
performed to be able to take advantage of current research in the field of replacement
algorithms. The results are then compared using a quantitative evaluation method i.e.
using an assessment process that compares discrete numerical values.
Chapter 1
4
The outputs of the algorithms, which are visualized as graph diagrams, are the
calculations of hit ratio as a function of cache size n. At last, latency per TTI is
calculated and presented as the percentage of cache misses in one TTI.
Chapter 2
5
Chapter 2
Chapter 2 intends to lay the foundation of the theory that is essential in
order to understand the problem that this degree project aims to explore.
This includes basic theory about memory management systems, cache
memories and a state-of-the-art evaluation of replacement algorithms with
varying degree of complexity.
Memory management
The operating system
An operating system is the layer of software closest to the hardware and runs in kernel
mode, as oppose to user mode programs. The operating system serves higher-level
software with an abstract set of resources and manages access to hardware on its own. It
has the permission to execute any machine instruction (Tannenbaum, 2009). The
layered communication is portrayed in Figure 1. Hardware components that the
operating system manages include the CPU, memory and I/O devices (Tannenbaum,
2009).
Figure 1: The standard layered structure of communication of a computer architecture.
A process is a key concept enabling several programs to run simultaneously and the
operating system is responsible for scheduling all processes for access to the processor
such that each program receives a fair amount of the available resources. The constant
switching between processes is what gives the appearance of running all processes
simultaneously and while a process is scheduled for processor access, it must also be
allocated a sufficient amount of memory (Tannenbaum, 2009).
Chapter 2
6
The role of a cache memory
In order to somewhat compromise between the requirements of both a fast and
affordable memory that is sufficiently large to store the desired data, the memory
system is most often structured according to a layered hierarchy (Frick, 2009). The
further away from the CPU, the slower and cheaper the memory becomes. The fastest
storage hence is the registers in the CPU with no latency but with less than 1 KB
storage. The next layer is usually a high-speed cache but still relatively small in size and
so forth (Frick, 2009). This hierarchy is illustrated in Figure 2.
Figure 2: A typical memory hierarchy in a memory management system (Frick, 2009).
The idea is that not all information is needed at the same time. Data is fetched on
demand from lower levels to higher levels unless they are already located in the higher
level. This is called a cache hit and entails significant time savings. Data that will soon
be used again should therefore be kept in the cache for better efficiency.
Memory access patterns
A cache replacement algorithm is a policy that decides what to evict when the cache is
full and needs to make room for new data. The suitable replacement algorithm for a
specific implementation depends on the access pattern of the requests of data from
memory. Replacement algorithms are mostly evaluated based on input reflecting
different memory usage patterns, and it has been demonstrated empirically that a set of
the most well-known algorithms perform differently depending on the underlying
pattern of requests (Paajanen, 2007). The evaluation is done by comparing hit ratio as a
function of cache size n (Megiddo & Modha, 2003). Some access patterns indicate for
Chapter 2
7
example locality in time, such as a loop pattern, where the same data will be needed for
every cycle of the loop.
Figure 3: A definition of two common access patterns that can pollute the cache. A scan is a long
sequence of one-time data requests. Thrashing occurs when there is a loop pattern but the number
of elements in the loop is larger than the cache size (Jaleel, Theobald, Steely Jr., & Emer, 2010).
A loop pattern where the number of items in the loop exceeds the cache size is called a
thrashing access pattern (Jaleel, Theobald, Steely Jr., & Emer, 2010), defined in Figure
3a, due to its ability to pollute the cache. Any item in the loop is evicted before it is
used again in the loop. A scan, also called stream on the other hand, does not show
locality in time but it also has the ability to pollute the cache. The pattern, defined in
Figure 3b (Jaleel, Theobald, Steely Jr., & Emer, 2010), refers to a long sequence of
one-time data requests that replace the current and desired content of the cache with
data that will not be requested again. A scan resistant replacement algorithm is one that
is resilient towards this behavior and does not let the cache get polluted by scanning
patterns.
Another example of an access pattern is the correlated pattern (Paajanen, 2007). It
assumes that the same data is requested from memory twice within a short time frame
after which it is not requested again for a long period of time and does not need to
remain in the cache.
Bélády’s algorithm
The most optimal algorithm for any access pattern, as defined by Bélády, is an
algorithm that always evicts the cached data that will next be accessed the furthest away
in time compared to the other content in the cache (Paajanen, 2007). Such an algorithm
is only possible if the future data requests are completely predictable. This is most often
not the case, if ever, but Bélády’s algorithm can be used for reference calculations
(Paajanen, 2007). Replacement algorithms can be evaluated comparing their
performance with how Bélády’s algorithm would have performed in the same situation.
This performance is measured after the algorithms have run and the data requests that
occurred are known.
Chapter 2
8
Caching algorithms
Conventional replacement algorithms
Simple queue-based policies
The most commonly recognized replacement algorithm is the Least Recently Used
(LRU) algorithm illustrated in Figure 4. It always evicts cached data that has not been
used for the longest time (Megiddo & Modha, 2003) and performs considerably well on
many different types of workloads.
The Least Frequently Used (LFU) algorithm is also a straight-forward policy with the
aim to capture frequency instead of recency. The policy maintains a frequency count for
each cached item and evicts the item with the lowest number. The First-In-First-Out
(FIFO) replacement policy is commonly known where the oldest data is always evicted
regardless of when it was last used (Paajanen, 2007). FIFO is not widely implemented
as it has been shown to perform significantly worse than LRU for most memory
management systems (Paajanen, 2007). The queue structure is, however, easier to
implement and lower overhead than LRU since there is no constant need to move
around data inside the cache. A draw-back with LRU and FIFO is that they are not scan
resistant.
Figure 4: An illustration of the LRU replacement policy to the left, and the FIFO replacement
policy to the right. According to the LRU policy an item is moved back to the top of the queue upon
an access request.
Sequenced based
The Sequence based (SEQ) algorithm was proposed as an alternative to LRU (Paajanen,
2007). At most times it functions as an LRU policy but keeps track of cache misses
trying to detect long sequences of them. If such a sequence is detected, the algorithm
switches to a pseudo MRU (Most Recently Used) policy. The idea is thus to address the
bad performance of LRU on scan access patterns and otherwise to function as the LRU
policy normally does (Paajanen, 2007). This will improve LRU scan resistance.
Chapter 2
9
CLOCK
The basic model of the CLOCK algorithm is visible in Figure 5 (Paajanen, 2007).
Cached pages are kept in a circular list where the “clock pointer” points at the oldest
item in the list and each page item has a referenced bit. When an item needs to be
evicted, the algorithm searches for a page with referenced bit set to 0. It starts by
incrementing the clock pointer and if this item has a referenced bit set to 1, the bit is
reset to 0 and the pointer moves one step in the clock circle repeating the steps until an
item that can be replaced is found. The algorithm is low overhead as it does not move
around items between different queues but simply maintains the circle and only moves
the clock pointer. CLOCK is not scan resistant and it is shown later in this report that it
performs similarly to LRU.
Figure 5: The CLOCK model of a cache replacement algorithm where the clock pointer searches
for an item with its referenced bit not set.
Contemporary caching algorithms
Adaptive Replacement Cache
The Adaptive Replacement Cache algorithm (ARC) is based on the idea of combining
two policies in one, self-tuning between the two (Megiddo & Modha, 2003). The
algorithm, developed by IBM researchers Modha and Megiddo, is State-of-Art
acclaimed to perform better than both LRU and LFU. It addresses both recency and
frequency while also providing scan resistance. The advantage of the ARC algorithm,
compared to other replacement algorithms based on similar ideas, is the low-
computational-overhead-cost independent of cache size that the algorithm demonstrates
(Megiddo & Modha, 2003).
ARC maintains two different lists, L1 and L2. L1 contains pages that have been
referenced only once whereas L2 contains pages that have been referenced at least
twice. The data structure of ARC is depicted in Figure 6.
Chapter 2
10
Figure 6: The implementation structure of ARC where cached items are contents of T1 and T2. B1
and B2 keep references of evicted content to be able to adapt to the current access pattern.
The self-tuning property lies in the number of cached entries from each list. If the total
size of the cache is c, then both L1 and L2 will also hold references of c entries each.
This means that the total capacity of the combined lists is 2c. The algorithm chooses
and adjusts the number of cached entries from each list. When the algorithm detects that
most of the cache hits are from for example L1, it will automatically adjust to the
environment caching more pages from that list and consequently fewer pages from L2
(Megiddo & Modha, 2003).
Two Queue
The Two Queue (2Q) algorithm tries to improve the qualities of LRU also by
maintaining two different lists (Paajanen, 2007). One list is implemented as an ordinary
LRU and the other as a FIFO list. The FIFO list is further divided into one Fin list and
one Fout list where the latter one only contains reference information. Operations are
visualized in Figure 7 (Paajanen, 2007).
Figure 7: The operation policy of the 2Q cache replacement algorithm. Contents of the LRU queue
and Fin are cached whereas Fout keeps references of previous content of Fin.
Chapter 2
11
When a page is first accessed it is inserted at the top of the Fin list. When the Fin list
becomes full the last page in Fin is flushed and a reference to it is added to the Fout list.
The page is thus no longer cached but information about it is kept. If a page is accessed
that is referenced to from the Fout list, it is cached again onto the top of the LRU list. If
space becomes available in Fin, a cached page is moved from the end of the LRU list.
The purpose of this algorithm is to provide a scan resistant alternative to the ordinary
LRU policy where the pages referenced to once are quickly removed from the cached
and only truly “hot” pages are kept in the LRU list. The LRU list is larger in size than
Fin, about 3/4 of the total cache size (Paajanen, 2007).
Second Chance-Frequency Least Recently Used
Second Chance-Frequency LRU (SF-LRU) combines the LFU and LRU policies
(Alghazo, Akaaboune, & Botros, 2004). When a page is to be evicted from the cache,
an LRU policy is used to select the least recently used page in the cache. The next step
is to compare this page’s frequency value by LFU calculations with the second recently
used page, as illustrated in Figure 8 (Alghazo, Akaaboune, & Botros, 2004). The page
that is evicted is the one of the two that has the lowest frequency value. If a page is
saved by its frequency value, this value is reset as the page has been given a second
chance but it will not be saved again (Alghazo, Akaaboune, & Botros, 2004). The
algorithm is not scan resistant.
Figure 8: The operation policy of the SF-LRU cache replacement algorithm. A second chance is
given to the last item in the LRU queue by comparing its frequency value to that of the second-last
item.
CLOCK-Pro
The CLOCK-Pro algorithm implements the CLOCK model but with a different
approach to mark which pages that are candidates for eviction (Jiang, Chen, & Zhang,
2005). CLOCK-Pro calculates the “reuse distance” for each page which is the number
of times that any other page has been accessed since this specific page was last
accessed. This distance is used to determine if the page is “hot” or “cold”. Once a page
Chapter 2
12
is brought into the cache memory or degraded from hot to cold it is marked as cold in a
test period. The test period is given to prove its importance (Jiang, Chen, & Zhang,
2005).
Figure 9: The image from Jiang and Zhang (Jiang, Chen, & Zhang, 2005) shows the CLOCK-Pro
cache replacement model. Hot pages are marked with “H”, cold pages marked with “C” The check
marks represent the reference bits of 1.
Cold pages are eligible for eviction by the clock pointer but if a cold page is evicted
during its test period, the reference data is kept in the clock until the end of its test
period. If a cold page is accessed during its test period, it is re-marked as hot but if a test
period end without any memory accesses then the page becomes eligible for eviction
with no test period.
To be able to maintain these markings and make sure that the markings are up-to-date,
the CLOCK-Pro algorithm includes three different clock pointers in the clock circle; the
test, hot and cold pointers, that all have different marking and incrementing rules. The
modified CLOCK-Pro model is portrayed in Figure 9 (Jiang, Chen, & Zhang, 2005).
Dueling CLOCK
The Dueling CLOCK (DC) policy was designed to alter between a CLOCK model and
a scan resistant version of the CLOCK model (Janapsatya, Ignjatovic, Peddersen, &
Parameswaran, 2010). To add scan resistance to the CLOCK model, the “clock pointer“
is not incremented before searching for a page with referenced bit set to 0, the last
replaced data is also eligible for eviction. The DC algorithm self-tunes adapting
dynamically to the current behavior by alternating between these two models
(Janapsatya, Ignjatovic, Peddersen, & Parameswaran, 2010).
Chapter 2
13
Figure 10: The two CLOCK models that the CAR algorithm maintains. The image is from Modha
and Bansal, (Bansal & Modha, 2004). Additional characteristics of the algorithms are very similar
to ARC.
CLOCK with Adaptive Replacement
CLOCK with Adaptive Replacement (CAR) is also a State-of-Art replacement policy
inspired by ARC (Bansal & Modha, 2004). CAR maintains two “clocks” that are shown
in Figure 10, T1 that captures recency as in its original model and T2 that instead
captures frequency but otherwise functions the same. Similar to ARC, two lists B1 and
B2 are introduced to maintain the history information of the previous lists. B1 and B2 do
not contain data content; only reference information and these lists also have a
replacement policy. The algorithm uses B1 and B2 to dynamically adapt the sizes of T1
and T2. The implementation is therefore scan-resistant low overhead and self-tuned,
again similarly to ARC (Bansal & Modha, 2004).
CAR with Temporal filtering (CART) is an additional enhancement of the CAR
algorithm with the objective to be resistant to pollution of the correlated access pattern.
The temporal filter is added as a stricter rule for a page to advance from the recency
clock to the frequency clock. Two memory accesses within a short period of time are
not enough for a page to be considered frequent. CART therefore has all of the
advantages of the CAR algorithm but with an extra improvement that neither CAR nor
ARC incorporates (Bansal & Modha, 2004).
Related research fields
Research on caching techniques for traditional memory management systems, with a
memory hierarchy analogous to Figure 2, has been ongoing for many years. However,
recent advancements in mobile technology have boosted demand for research in new
Chapter 2
14
areas of application. Ye, Li and Chen, for instance, state that “Recently, multimedia data
caching is getting more attention” (Ye, Li, & Chen, 2007). The environment refers to
data caching between streaming servers and base stations which becomes necessary as
people tend to consume and stream more media content through their mobile devices.
Data traffic in mobile networks has quickly augmented over the past decade, and
therefore one can presume that an even more pressing demand for innovative caching
solutions will arise in the future.
Chapter 3
15
Chapter 3
Chapter 3 accounts for the first implementation phase of the degree
project. The development process, the prototype model and obtained
performance results are described. Implemented algorithms are validated
and network statistics is chosen to suggest how to incorporate statistics in
a caching policy.
Implementation
Preliminary process outline
The implementation process model that the degree project follows is outlined in Figure
11. The iterative development process is highlighted and allows for continuous
improvements and evaluation of new ideas. Once this process is completed, the final
results are compared to Bélády’s algorithm to see how it differs from an optimal
solution.
Figure 11: An iterative implementation process model of the project task.
Chapter 3
16
Choice of algorithms to implement
The selection of algorithms to evaluate in the first step of the process can be seen in
Figure 12. They have been chosen as a result of the pre-study to facilitate the detection
of access patterns or evaluation of current research in the field.
Figure 12: Algorithms chosen for the empirical study.
Traffic scenarios and simulated input
The evaluation of the implemented replacement algorithms is based on the four
different traffic scenarios, or user types, seen in Figure 13. FTP stands for File Transfer
Protocol. The scenarios are simulated for a base station with three cells that hold a large
number of simultaneous users. A cell with only a few simultaneous users is not relevant
to this project as latency between on-chip and off-chip memory is presumed to be
performance critical only for a large number of simultaneous users. The mixed scenario
that includes all different user types is most important for the final performance
evaluation as it emulates a realistic scenario. The proportion of different user types in
the mixed scenario is 60% Web browsing, 20% VoIP and 20% FTP download.
Figure 13: User types in a telecom network that are simulated as different traffic scenarios. The
proportion of different user types in the mixed scenario is 60% Web browsing, 20% VoIP and 20%
FTP download.
The prototype model
The application is implemented according to the model seen in Figure 14. The main
application processes the requests in the TTI’s according to the chosen cache strategies.
These strategies can be added or deleted on demand. A hash table is also mapped upon
the double linked list such that each access to an item is constant in time.
Chapter 3
17
Figure 14: Basic model of the implemented application and its interaction with the algorithms.
Code verification
The final implementation code for all algorithms is reviewed by Ericsson personnel in
order to ensure that the algorithms are correctly implemented according to their
respective pseudo code.
In addition, LRU, ARC, CLOCK and CAR are verified against the test data workloads
provided by IBM, (Megiddo & Modha, 2003). These workloads are further explained in
research paper: ARC: A SELF-TUNING, LOW OVERHEAD REPLACEMENT CACHE
(Megiddo & Modha, 2003). Tested workloads are: P3, P4, P5, P6 and P8. Figure 15 and
Figure 16 are examples of the compared graphs for workload P6.
The correct curves are obtained for all four algorithms and single data points in the
graph differ in value by at most 0.4 percentage points. Therefore the implemented
algorithms and the application can be considered validated with no significant logical or
practical error according to the requirements of both KTH and Ericsson. Figure 15 and
Figure 16 also emphasize the similarities in performance of ARC and CAR as well as
LRU and CLOCK.
Chapter 3
18
Figure 15: Code verification. Hit ratio results on workload P6 of the implemented versions of CAR
and CLOCK are shown to the left. The results correspond with the official graph of (Bansal &
Modha, 2004) to the right.
Figure 16: Code verification. Hit ratio results on workload P6 of the implemented versions of ARC
and LRU are shown to the left. The results correspond with the official graphs of (Megiddo &
Modha, 2003) to the right.
Results
The output graphs for all different scenarios are presented in Figures 17-24. The graphs
of seven algorithms are divided into two groups for convenience.
Please note that some algorithms that tend to perform similarly also result in similar
graphs. These curves can be difficult to separate from one another in the figures below.
The main intention by presenting the graphs in this report is, however, to show the
reader which curves that do or do not differ from each other, not to interpret the exact
numerical values.
Chapter 3
19
Please also note that the graph of LFU in Figure 17 does not approach a hit ratio of
100% which is correct behavior. A disadvantage of LFU is that former frequent items
that now should be evicted have a much higher frequency count than newly added items
and therefore remain cached unnecessary long. In the VoIP scenario new users are
added at a higher rate than the other scenarios and hence suffer from this disadvantage
which is the reason why the graph does not reach 100%. The number of active users in
the system remains constant but new users are added as old users leave.
Figure 17: Random, FIFO and LFU hit ratios for the VoIP traffic scenario. FIFO suffers from
thrashing for small cache sizes.
Chapter 3
20
Figure 18: LRU, ARC, CLOCK and CAR hit ratios for the VoIP traffic scenario. LRU and
CLOCK suffer from thrashing for small cache sizes. ARC and CAR demonstrate unstable
behavior.
Chapter 3
21
Figure 19: Random, FIFO and LFU hit ratios for the Web traffic scenario.
Figure 20: LRU, ARC, CLOCK and CAR hit ratios for the Web traffic scenario.
Chapter 3
22
Figure 21: Random, FIFO and LFU hit ratios for the FTP traffic scenario.
Figure 22: LRU, ARC, CLOCK and CAR hit ratios for the FTP traffic scenario.
Chapter 3
23
Figure 23: Random, FIFO and LFU hit ratios for the Mix traffic scenario.
Figure 24: LRU, ARC, CLOCK and CAR hit ratios for the Mix traffic scenario.
Chapter 3
24
Performance analysis
The results indeed imply that the access patterns vary for the different scenarios. It is
important to find an algorithm that performs considerably well for all scenarios,
especially the Mix as all user types are present in the telecom network.
It is evident that ARC and CAR show similar and mostly good performance but lack in
stability. Figure 18, for example, indicate that the algorithms occasionally suffer from
thrashing on the specific workload but only for some cache sizes. This behavior is not
desired in the intended environment. It is essential that the algorithm behavior is
somewhat predictable such that the performance should not worsen with a larger cache
size.
The graphs also infer that LRU, CLOCK and FIFO suffer from thrashing for smaller
cache sizes on the VoIP pattern. In general the VoIP traffic scenario seems to be more
difficult for a caching policy to manage whereas Web, for instance, is much easier.
The analysis concludes that, at this point, Random would be the best choice of an
algorithm that is stable and relatively good for all scenarios. CLOCK or LRU are also
good alternatives despite the lousy performance on VoIP. LFU demonstrates bad
performance on all scenarios except VoIP, where LRU is much worse, which stresses
that a well customized algorithm should in fact address both recency and frequency to
accommodate all access patterns. As mentioned above, ARC and CAR do demonstrate
good performance but the instability is not acceptable for the intended purpose.
Resource Requirements
Time complexity of the algorithms will depend slightly on design choices of the
implementations. It is not certain that these specific implementations of the algorithms
are applicable to the environment of the operating system in the base stations. Ericsson
is therefore advised to also, if relevant, analyze the algorithms with respect to the
possibilities and limitations of their own software and hardware.
Note that it is not critical that the prototype implementation is optimal as the aim is to
evaluate the output of the algorithms and not measure running time. Nonetheless, in this
implementation, most algorithms use a hash table which is mapped onto a double linked
list such that checking for a cache hit and accessing an item is constant in time but it
requires the additional data structure of a hash table. Time complexities and memory
requirements for all seven algorithms are found in Table 1.
Chapter 3
25
Algorithm / Method lookup update replace Memory
requirement
LRU O(1) O(1) O(1) c
LFU O(1) O(nlogn) O(nlogn) c
CLOCK O(1) O(1) O(n) c
ARC O(1) O(1) O(1) 2c
CAR O(1) O(1) O(n) 2c
FIFO O(1) O(1) O(1) c
Random O(1) O(1) O(n) c
Table 1: Time complexities of the three main methods, and memory requirements for cache size c.
CLOCK and CAR have a worst-case time complexity for replace of O(n) due to the
clock search of a false reference bit. Random has the same complexity for replace as
CLOCK and CAR because it iterates to the randomly generated position in the list. LFU
has a has a theoretical update and replace time complexity of O(logn) due to heap
traversal but the current implementation of converting between iterators and positions
lead to an actual complexity of O(nlogn).
ARC, on the other hand, is constant but it must be emphasized that the algorithm is
constant with respect to cache size n but it requires more clock cycles compared to the
other algorithms as more operations need to be performed. ARC and CAR, in addition
to 2Q and CLOCK-Pro, also require more memory due to the task of keeping track of
history information. These are definitely disadvantages of ARC and CAR as both
memory and performance resources of the base stations are limited.
Incorporated network statistics
The algorithms that incorporate network statistics employ so-called metric values for
each user which are updated for every TTI. It is not disclosed in this thesis report how
the metric values are calculated and on which scheduling parameters they depend upon.
It is a combination of various aspects such as the user type, reuse distance and other
radio conditions. These values can, however, be regarded as weights that hypothetically
are assumed to increase as the probability of being scheduled increases. The degree
project determines whether the hypothetical relation is supported or not by the results of
the algorithms.
Chapter 4
26
Chapter 4
Chapter 4 describes the second implementation phase. These replacement
algorithms utilize network statistics with the intention to customize for the
telecom environment. A new algorithm named Three-Queue-Metric
Algorithm is designed. Results are compared to some of the previous
algorithms and ultimately to the optimal solution.
The iterative design of an intelligent data
caching algorithm
A simple heuristic based on metric values
The first algorithm to be evaluated is one that uses the heuristic of evicting the cached
user context with the least probability of being scheduled according to the hypothesis.
The hypothesis states that the user context with the least metric value has the least
probability of being scheduled. If there is more than one with the same metric value,
one is chosen at random. The algorithm will from here on be called Heuristic-Based
Metric Algorithm (HBM).
Results for the four traffic scenarios are shown in Figures 25-28. It is clear that HBM is
stable and performs well on a workload corresponding to all scenarios although it is not
the best one in every scenario. It follows the same shape of curve as Random, but with
improved performance, and it can hence be deduced that the metric values do show
correlation to probability of future scheduling decisions. Otherwise HBM would be no
better than pure randomness. This leaves room for possible improvements. As discussed
previously, a well suited algorithm should also attempt to capture both recency and
frequency for better performance, which will have to be incorporated. Another
disadvantage is the time complexity. Every replace operation traverses through all
cached items and it is therefore linear, not only in the worst-case scenario.
Analysis of the results thus concludes that although the algorithm demonstrates good
stability and performance, an algorithm with better complexity and/or improved
performance on certain workloads would be preferable. This can perhaps be achieved
by addressing recency and/or frequency better. The next section describes several
unsuccessful attempts that have been made to achieve better results. Although not very
Chapter 4
27
successful, they did lead to valuable reflections that spurred the design of a more
successful attempt that will be discussed later.
By previous analysis it is assumed that ARC and CAR have similar characteristics as do
CLOCK and LRU. Therefore only ARC and LRU will be regarded in the following
investigation with the aim to improve HBM with respect to hit ratio and the number of
required clock cycles. Random and FIFO are disregarded because HBM already
performs better.
Performance results of HBM
Please see the next page.
Chapter 4
28
Figure 25: Performance of HBM for the VoIP traffic scenario.
Figure 26: Performance of HBM for the Web traffic scenario.
Chapter 4
29
Figure 27: Performance of HBM for the FTP traffic scenario.
Figure 28: Performance of HBM for the Mix traffic scenario.
Chapter 4
30
Combining HBM with LRU
Due to the above analysis, it seems plausible that both performance and stability is
achievable perhaps by incorporating a metric-based heuristic into ARC or LRU. This
section concerns LRU. Two new algorithms named LRU-M and M-LRU are designed
to combine LRU and HBM. They now intend to capture the predictability of metric
values as well as recency. The logical models of these algorithms are shown in Figure
29.
LRU-M M-LRU
FIGURE 29: LRU-M is illustrated to the left and M-LRU is illustrated to the right. LRU-M evicts
the item with the least metric value from a 10 percent selection of the least-recently-used items.
LRU-M is designed to reduce the number of cached items that have to be looked at for
each replacement operation. M-LRU, on the other hand, is designed to see if
performance of HBM could improve with the LRU approach on items with the same
least metric value. Unfortunately, neither LRU-M nor M-LRU results in a better overall
performance than HBM. Results are visible in Figure 31. Some graphs from here on in
the report have been omitted due to redundancy or lack of contribution value to the
analysis.
FIGURE 30: ARC-M operates similarly to ARC except the type of decision policy in T1 and T2.
Chapter 4
31
Figure 31: Performance of LRU-M and M-LRU for the Mix traffic scenario. Neither one shows
better performance than HBM.
Combining HBM with ARC
To combine advantages of both HBM and ARC, ARC-M is designed according to the
model in Figure 30. T1 and T2 are HBM queues instead of LRU queues whenever the
ARC policy chooses to replace an item in either one of them. The task of T2 is still to
capture frequency.
Furthermore, ARC-MT1 is designed to determine whether the metric-based heuristic
also is preferable over an LRU queue for the T2. The model is visualized in Figure 32.
T2 remains as an LRU queue whereas T1 is implemented as an HBM queue.
The hit ratios of ARC-M and ARC-MT1, depicted in Figure 33 and Figure 34, show
that ARCM-T1, ARC-M and ARC all share similar performance. This means that both
newly designed algorithms are also considered instable. Previous conclusions, though,
suggest that recency and frequency should be addressed but the new results infer that
the adaptive characteristics also entail instability. This conclusion incited the attempt to
design an algorithm that addresses frequency, recency and metric values separately in
fixed sizes with the hypothesis that the sizes perhaps do not need to be adaptive.
Chapter 4
32
FIGURE 32: ARC-MT1 applies a metric-based queue only to T1 whereas T2 remains an LRU
queue.
Three-Queue-Metric
Three-Queue-Metric (3QM) is first designed to consist of two queues, one metric-based
queue and one LRU queue. ARC is self adaptive and keeps track of history information.
The goal is to design a more stable algorithm on the specific access patterns, which
ARC and the variants proved not to be, and therefore it does not incorporate any
adaptive characteristics. As a consequence, the sizes of the queues are constant and have
to be tuned.
The problem with this structure is that items from the HBM queue are added to the LRU
queue upon access and are obligated to remain there until they reach the LRU position.
This means that the separation between the queues does not function properly and the
results are dissatisfactory.
Inspired by CART, that uses a temporal filter to separate the two clock models (Bansal
& Modha, 2004), a filter queue is added in between the two queues with the purpose to
filter through only the truly frequent items to the LRU queue and send the other items
back to the HBM queue. The operations of 3QM are found in Figure 35. The filter
queue is also an LRU queue. Any item in it will have to be added to HBM queue before
it can be considered for eviction. At this stage the new and therefore more accurate
metric value is regarded and the current metric value in the filter hence does not matter.
In order for the filter to fulfill its purpose it is necessary that its size is relatively small.
Different sizes of the three queues with a total cache size of c were tested and it was
found that a good proportion division of the cache size c in percentages is 50-20-30 for
each queue respectively.
Chapter 4
33
Figure 33: Performance ARC-M and ARC-MT1 for the VoIP traffic scenario.
Figure 34: Performance ARC-M and ARC-MT1 for the Mix traffic scenario.
Chapter 4
34
FIGURE 35: Structure and filter operations of the 3QM cache replacement algorithm. An access
request in the HBQ queue moves the item to the filter queue and, if accessed again, to the LRU
queue.
The theoretical proof of an optimal partitioning is beyond the scope of this degree
project. It is my belief, however, that the sizes depend on the proportion of different
user types in the network. As discussed, Figures 25-27 indicate that VoIP exhibits
locality in frequency intervals whereas FTP and Web exhibit locality of time and metric
values. The intention of 3QM is to separate these different users according to Figure 36
such that VoIP users are moved to the LRU queue while the other users remain in the
HBM queue.
FIGURE 36: Logical idea of the 3QM cache replacement algorithm. Frequency is address by
allowing frequent items to be inserted into the LRU queue to the right.
The result of 3QM compared to HBM is presented in Figures 37-40. 3QM performs
better than HBM in almost all areas of the graphs yet demonstrating better stability than
ARC and CAR. As intended, the best increase in performance compared to HBM is for
VoIP users. Moreover, although the algorithm has the same theoretical and linear time
complexity as HBM; it will only have to traverse through half of the cached items for
every replacement operation and therefore halves the number of required operations.
Additionally, there is no need for extra history information because sizes are not
adapted online which minimizes memory consumption. Instead, the queue sizes have
been adapted offline to conform to the intended environment.
Chapter 4
35
Figure 37: Performance of 3QM for the VoIP traffic scenario. 3QM performs better than HBM and
LRU.
Figure 38: Performance of 3QM for the Web traffic scenario.
Chapter 4
36
Figure 39: Performance of 3QM for the FTP traffic scenario.
Figure 40: Performance of 3QM for the Mix traffic scenario.
Chapter 4
37
3QM is scan resistant as one-time request items do not advance to the next queues and
are therefore eligible for eviction immediately after they have been added to the cache.
Latency calculations
The total latency for each caching algorithm and cache size n is proportional to the
cache miss ratio and therefore inversely proportional to the hit ratio. The graphs visible
in Figure 41, on the other hand, show miss ratio per TTI. It should not be assumed that
a total cache hit ratio of 30 percent entail a cache hit ratio of 30 percent for each TTI.
For convenience, only every 400th
TTI is marked on the graphs and the graphs do tend
to fluctuate over a considerably large range of values. It is therefore emphasized that
there is no guarantee that there will not be any overflow from one TTI to the next but
rather that the total latency is minimized to a certain value.
Figure 41: Final latency calculations for the Mix traffic scenario shown as miss ratio per TTI. The
fluctuations emphasize the variation of hit ratio from one TTI to another. Hit ratio results are only
a measure of average performance.
A comparison to the optimal solution
Figure 42 shows the performance of 3QM in comparison to Bélády’s algorithm. As
discussed previously, the optimal caching algorithm is not realistically achievable if the
Chapter 4
38
sequence of requests is unknown. Therefore 3QM can be considered to perform
relatively well despite the gap between the two graphs.
Figure 42: A comparison of the 3QM cache replacement algorithm to Bélády’s optimal algorithm.
.
Chapter 5
39
Chapter 5
Chapter 5 is the last chapter and concludes the degree project with a
discussion of the results obtained throughout the course of the study and
recommendations for Ericsson.
Conclusion
Discussion
This degree project empirically proves that network statistics in the operating system of
a base station can be utilized when evicting cached user contexts. Figure 39, for
instance, shows the final performance results of 3QM and HBM for the mixed traffic
scenario. Latency can almost be halved using a cache size of 1/10 of the total number of
users; a result that was not achievable with any of the algorithms that do not take
advantage of a priori statistics.
In addition to network statistics, an effective algorithm should also try to capture
recency and frequency. The access pattern of memory requests vary between the
different traffic scenarios and these need to be addressed separately. VoIP users, for
example, tend to demonstrate a pattern of frequent intervals whereas network statistical
weights are particularly good for FTP user scenarios. 3QM was designed to filter
between the two and apply different policies to these queues.
The disadvantage of addressing network statistics in the caching algorithm is the time
complexity. The metric values are updated every TTI and the complexity is thus linear.
LRU or Random has much better performance with respect to the number of required
clock cycles for each replacement operation. 3QM only looks at the metric values of
half of the cached items and therefore outperforms HBM in both hit ratio and
complexity.
Although ARC and CAR are state-of-the art within the area of caching algorithms and
indeed perform well, they are not stable enough on this specific input data and therefore
do not meet the requirement of a suitable algorithm for the intended purpose. It should
also be noted that ARC is patented by IBM.
Chapter 5
40
As discussed, it has been concluded that the obtained cache hit ratio is only guaranteed
as an average but not for a single TTI. Overflow from one TTI to the next is therefore
possible independent of total cache hit ratio.
Recommendations
Due to the aspects of the above discussion, Ericsson is advised to choose a caching
algorithm between 3QM, Random or LRU depending on the most limiting factor. 3QM
performs well across all traffic scenarios and has the highest hit ratio but relatively bad
complexity. Random has the lowest overhead and is stable in all scenarios but has
significantly less hit ratio performance. LRU is low overhead with average performance
(if it is implemented such that moving and accessing an item is constant) but suffers
from pollution in the VoIP scenario.
Future work
The topic of intelligent caching algorithms and intelligent algorithms, in general, is very
promising. My belief after having completed this degree project is that intelligent
algorithms will be incorporated into even more areas of society as the amount of data to
process and the number of dimensions to consider is becoming too large for humans to
interpret. As mentioned, research is ongoing regarding caching algorithms and content
media streaming (Ye, Li, & Chen, 2007), and the current trend in data caching seems to
be more intelligent solutions customized for the intended environment.
An idea that stroked my mind during the pre-study is that perhaps Hidden-Markov-
Models could be used to predict future requests depending on the current sequence of
requests if modeled as chains. It is my belief that some attempts have been made to
combine caching algorithms and Markov models and perhaps a telecom environment is
not the best environment to model accordingly. This question will however be left for
future studies to determine.
41
Bibliography
Alghazo, J., Akaaboune, A., & Botros, N. (2004). SF-LRU Cache Replacement Algorithm.
Carbondale: Southern Illinois University at Carbondale.
Bansal, S., & Modha, D. S. (2004). CAR: Clock with Adaptive Replacement. San Francisco, CA:
USENIX Conference on File and Storage Technologies (FAST 04).
Ericsson AB. (December 2010). Ericsson. Retrieved from Company Facts:
www.ericsson.com/thecompany/company_facts on July 14th 2012
Ericsson AB. (December 2012). LTE Brochure. Retrieved from LTE: A Global Success Story:
http://www.ericsson.com/res/docs/2012/erix1202_lte_brochure.pdf on July 14th 2012
Frick, I. (2009). Course material: DD2486 Systemprogrammering. Lecture 4: Minneshierarki,
lokalitet och virtuellt minne. Stockholm: KTH Royal Institute of Technology.
International Telecommunication Union. (on July 17th 2012). ITU. Retrieved from About ITU:
www.itu.int/net/about/mission.aspx on July 17th 2012
Jaleel, A., Theobald, K. B., Steely Jr., S. C., & Emer, J. (2010). High Performance Cache
Replacement Using Re-Reference Interval Prediction (RRIP). Saint-Malo, France: ISCA.
Janapsatya, A., Ignjatovic, A., Peddersen, J., & Parameswaran, S. (2010). Dueling CLOCK:
Adaptive Replacement Policy Based on The CLOCK Algorithm. Leuven, Belgium: European
Design and Automation Association.
Jiang, S., Chen, F., & Zhang, X. (2005). CLOCK-Pro: An Effective Improvement of the CLOCK
Replacement. Berkeley: USENIX Association Berkeley.
Lte World. (August 2009). LTE Advanced. Retrieved from LTE Advanced:
http://lteworld.org/wiki/lte-advanced on March 15th 2012
Megiddo, N., & Modha, D. S. (2003). ARC: A Self-Tuning, Low Overhead Replacement Cache.
San Francisco, CA: USENIX Conference on File and Storage Technologies (FAST 03).
Paajanen, H. (2007). Page replacement in operating system memory. Jyväskylä: University of
Jyväskylä, Department of Mathematical Information Technology.
Tannenbaum, A. S. (2009). Modern Operating Systems (Vol. 3rd Edition). Saddle River, N.J:
Pearson.
Ye, F., Li, Q., & Chen, E. (2007). An Evolution-Based Cache Scheme for Scalable Mobile Data
Access. Suzhou, China: INFOSCALE.
TRITA-CSC-E 2012:077 ISRN-KTH/CSC/E--12/077-SE
ISSN-1653-5715
www.kth.se