32
07/05/2005 1 The Performance Impact of Ke rnel Prefetching on Buffer C ache Replacement Algorithms by Ali R. Butt, Chris Gniady, and Y.Charlie Hu, SIGMETRICS’05 Course: CSCI 780 – Advanced Topics on Caching Techniques in Computer and Distributed Systems Presenter: Chuan Yue

Course: CSCI 780 – Advanced Topics on Caching

  • Upload
    gus

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms by Ali R. Butt, Chris Gniady, and Y.Charlie Hu, SIGMETRICS’05. Course: CSCI 780 – Advanced Topics on Caching Techniques in Computer and Distributed Systems Presenter: Chuan Yue. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 1

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement

Algorithms

by Ali R. Butt, Chris Gniady, and Y.Charlie Hu,SIGMETRICS’05

Course: CSCI 780 – Advanced Topics on Caching

Techniques in Computer and Distributed Systems

Presenter: Chuan Yue

Page 2: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 2

Outline

• The Buffer Cache

• Linux Kernel Prefetching

• Adapted Buffer Cache Replacement Algorithms

• Simulation Results

• Conclusions

• Discussions

Page 3: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 3

Buffer Cache in Main Memory

• Two kinds of I/O operations:– Direct access read()/write()

use block-based buffer cache– Memory-mapped I/O share

page cache with the virtual memory system

• Naturally that leads to two separate buffers

• Problems:– Double buffering– Inconsistencies

I/O usingread/write

virtualmemory

memory-mapped I/O

page cache buffer cache

disk

Page 4: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 4

Unification of Buffer Cache and Page Cache

• A unified buffer cache uses the same page cache to store– Virtual memory pages– Memory-mapped pages– Ordinary file system I/O

• Issues:– complex interactions between

file system and VM

I/O usingread/write

virtualmemory

memory-mapped I/O

disk

unified buffer cache

Page 5: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 5

Buffer Cache Management

• Designing effective buffer cache replacement algorithms is a fundamental challenge in improving system performance– Traditional file I/O system – Virtual memory system

• Various buffer cache replacement algorithms– LRU replacement is widely used– LRU’s inability to cope with access patterns with weak locality– Other well-known algorithms that utilize recency information:

LRU-2, 2Q, LIRS, LRFU, MQ, ARC

Page 6: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 6

Prefetching

• Prefetching is another highly effective technique used for improving the I/O performance

• The main motivation for prefetching is to overlap computation with I/O and thus reduce exposed latency of I/O

• Various prefetching techniques:– Prefetching using user inserted hints of I/O access patterns

• Drawback: placing burden on programmer

– File system kernel-driven prefetching in modern operating systems• Synchronous read-ahead to amortize seek cost

• Asynchronous prefetching after detecting sequential access patterns

Page 7: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 7

The impact of kernel prefetching on buffer cache replacement algorithms’ performance

• The close interactions between caching and prefetching– Prefetching file blocks into cache can be harmful (P. Cao, et. al., 1995)– Both replacement policy & prefetching buffer cache hit ratio– Hit ratio, prefetching & clustering I/O disk traffic– I/O disk traffic file system performance

• Almost all proposed buffer cache replacement algorithms didn’t take into account the kernel driven prefetching

• The work in this paper:– Shows the potential performance impact of kernel prefetching on

buffer cache replacement algorithms– Presents the simulation results on 8 adapted replacement algorithms

Page 8: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 8

Kernel components on the path from file system operations to the disk

Page 9: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 9

Kernel Prefetching in Linux

• Prefetching is based on the pattern of accesses to the file– Only considers prefetching for read accesses– Beneficial for sequential accesses to a file

• Read-ahead Group and Read-ahead Window

• Synchronous Prefetching and Asynchronous Prefetching

1 2 3 4 5 6 7 8 9 10

group

window

group

window

new group

window

Page 10: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 10

Belady’s algorithm can be non-optimal given kernel prefetching

• Access sequence: a c e g i k m o a b c d e f g h i j k l m n o p

• Without prefetching: Belady’s Alg. 16 cache misses; LRU 23 cache misses

• With prefetching: Belady’s Alg. 8 cache misses; LRU 6 cache misses

Page 11: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 11

Prefetching has been ignored in algorithm design

• Caching algorithms have been proposed and studied without considering prefetching– OPT– LRU – LRU-K [SIGMOD 1993]– 2Q [VLDB 1994]– LRFU [TC 2001]– MQ [USENIX 2001]– LIRS [SIGMETRICS 2002]– ARC [FAST 2003]

• Changes to OPT, LRU, 2Q, LIRS will be explained

Page 12: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 12

OPT

• OPT is based on Belady’s cache replacement algorithms.– Off-line, has the knowledge of future references

• In the presence of the Linux kernel prefetching– Prefetched blocks are assumed to be accessed most recently and

inserted into the cache according to the original OPT algorithm– But, OPT is added the capability to immediately determine wrong

prefetches, i.e., prefetched blocks that• will not be accessed on-demand at all, or

• will be accessed further in future than all other blocks in the cache

– Wrong prefetched blocks become immediate candidates for removal

Page 13: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 13

LRU

• LRU is the most widely used replacement policy

• In the presence of the kernel prefetching, adapted LRU:– Each access, kernel determines the number of blocks that need to

be prefetched– Prefetched blocks are inserted in the MRU locations just like regular

blocks

Page 14: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 14

2Q

• Three buffers and the algorithm:– A1in queue: all missed

blocks are initially placed

– A1out queue: when blocks are replaced from the A1in queue in the FIFO order, their addresses are temporarily placed

– Am queue: When a block is re-referenced and its address is in the A1out queue, it is promoted to Am queue

Block 10, 11, 12, 13, 14, 11, 12, 22,

1011

AmA1in

A1out

Ad

dre

ss o

nly

12131422

Page 15: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 15

2Q – With Adaptation (In the presence of the kernel prefetching)

• Prefetched blocks are treated as on-demand blocks:– A prefetched block is placed into the

A1in queue initially

– On the subsequent on-demand access, the block stays in the A1in queue

– If the prefetched block is evicted from the A1in queue before any on-demand access, it is simply discarded, as opposed to being moved into the A1out queue

– If a block currently in the A1out queue is prefetched, it is promoted into Am queue as if it is accessed on-demand

Demand & Prefetch blocks10, 11, 12, 11, 13, 14, 11, 22, 23

1011

AmA1in

A1out

Ad

dre

ss o

nly

1213142223

Page 16: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 16

LIRS

• Dynamically and responsively maintains the LIR block set and HIR block set and keeps LIR block set in the cache

• In the presence of the kernel prefetching, adapted LIRS:– Prefetched blocks are not inserted into the LIRS stack S, they are

only inserted into the HIR stack Q– If a prefetched block did not have an existing entry in LIRS stack S,

the first on-demand access to the block will cause it to be inserted onto the top of LIRS stack S as a HIR block

– If a prefetched block exists in LIRS stack S, the first on-demand access to the block will be treated as a LIR block access

Page 17: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 17

Performance Evaluation

• Trace collection– Interception of I/O system calls (using modified linux strace utility)– Collect I/O access type, time, file identifier (inode), and I/O size

• Timing accurate trace simulator– Detailed implementation of kernel prefetching and clustering– Interface with DiskSim simulator to simulator I/O time– Implementation of: OPT, LRU, LRU-2, LRFU, LIRS, MQ, 2Q, ARC

• Metrics– Hit ratio– Aggregated synchronous and asynchronous disk I/O requests – Actual running time

Page 18: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 18

Applications and Trace Statistics

(Concurrent applications: Multi1: cscope, gcc; Multi2: cscope, gcc, viewperf; Multi3: glimpse, TPC-H.)

Page 19: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 19

Hit ratio results for cscope

• Kernel prefetching has a significant impact on the hit ratio• The improvement for different algorithms differ• Prefeching can result in significant changes in the relative performance of

replacement algorithms

Page 20: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 20

Disk requests results for cscope

• The clustering of I/O requests in the presence of prefetching results in a significant reduction in the number of disk requests

• The effect is complex and closely tied to the file access patterns

Page 21: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 21

Execution time results for cscope

• Reduction in the # of disk requests due to kernel prefetching does not necessarily translate into reduction in execution time.

Page 22: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 22

Results for other three sequential access applications

• Glimpse– It also benefits from prefetching– The changes in the relative behavior of different algorithms observed

in cscope with prefetching are also observed in glimpse

• Viewperf– It benefits the most from prefetching– The behavior of different cache replacement algorithms is similar to

that observed in cscope

• Gcc– Many accesses are to small files, little opportunity for prefetching– All three performance metrics are almost identical with and without

prefetching

Page 23: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 23

Hit ratio results for tpc-h

• Prefetching provides little improvement on the hit ratio for random access pattern

Page 24: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 24

Disk requests results for tpc-h

• Most of prefetched blocks are not accessed and as a result the number of disk requests is doubled

Page 25: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 25

Execution time results for tpc-h

• The significant increase in the number of I/Os translates into a significant increase in the execution time

Page 26: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 26

Results for concurrent applications

• Multi1: cscope, gcc– Similar as that of cscope

• Multi2: cscope, gcc, viewperf– Similar as that of Multi1, however, prefetching does not improve the

execution time because viewperf is CPU-bound

• Multi3: glimpse, TPC-H– Similar as that of tpc-h

Page 27: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 27

Number and size of synchronous and asynchronous disk I/Os in cscope at 128MB cache size

• The total number of disk requests with prefetching is as least 30% lower than without prefetching for all schemes except OPT

• Most reduction in disk requests comes from issuing asynchronous disk requests which can be overlapped with CPU time

Page 28: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 28

Conclusions

• In this research work, the authors– Proposed prefetching implementation for different replacement algorithms

– Built a timing simulator to evaluate relative performances

• The paper shows– Prefetching impacts hit ratio, disk requests, execution time

– Comparison of hit ratios is insufficient

– Kernel prefetching can narrow the performance gap of different replacement algorithms

– Kernel prefetching can also change the relative performance benefits of different replacement algorithms

• Future buffer caching research should – Take into consideration prefetching and I/O clustering

– Simulate execution time

Page 29: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 29

Discussions (1)

• Good points– No new algorithm; but the paper is the first to simulate and compare the impact

of kernel prefetching on well-known cache buffer replacement algorithms

– Results are not very astonishing, we can guess the general results for sequential and random workloads; but this paper is the first to report the results

• Bad points– The simulation is only based on I/O traces. It better VM traces based results are

also presented.

– Concurrent applications simulation results are not analyzed in detail (in this paper itself).

– It better the unification of buffer cache and page cache in many OSes be considered. It better the competition between process page access and file cache page access be simulated and analyzed.

Page 30: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 30

Discussions (2)

• Some questions: – Regarding Belady’s anomaly:

• In LIRS paper: Belady's anomaly appears in 2Q and ARC for glimpse workload• In this paper: Without prefetching, their simulation results didn't show Belady's

anomaly. With prefetching, Belady's anomaly appears in ARC for glimpse workload• Why the differences? LRU has no Belady's anomaly. How about other algorithms?

– Regarding simulations:• Is there any relationship between cache size selection (in simulation) with the real

environment where the trace is collected?• Is the performance under thrashing condition still worth simulating?

Page 31: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 31

References

• “A Study of Integrated Prefetching and Caching Strategies”, P.Cao, et., al., ACM SIGMETRICS, 1995

• “Making LRU Friendly to Weak Locality Workloads: A Novel Replacement Algorithm to Improve Buffer Cache Performance”, S. Jiang and X. Zhang, IEEE Transactions on Computers, VOL.54, NO.9, SEPTEMBER 2005

• “CLOCK-Pro: An Effective Improvement of the CLOCK Replacement”, S. Jiang, F. Chen, and X. Zhang, Proceedings of 2005 USENIX Annual Technical Conference (USENIX'05)

• "Page Replacement in Linux 2.4 Memory Management," Rik van Riel, Proc. of 2001 USENIX Technical Conference, FREENIX track

• Towards and O(1) VM: Making Linux virtual memory management scale towards large amounts of physical memory, Rik van Riel, Proceedings of the Linux Symposium, July 2003

• “Journal File Systems in Linux, June 21th, 2005”

(http://bulma.net/impresion.phtml?nIdNoticia=1154)

• “The Buffer Cache, June 21th, 2005”

(http://www.faqs.org/docs/linux_admin/buffer-cache.html)

• “The Performance Impact of Kernel Prefetching on Buffer Cache Replacement”, Chris Gniady, et., al., (Purdue University), ACM SIGMETRICS 2005 presentation slides

• More on File System (lecture notes, June 22th, 2005)

(http://www.cs.rochester.edu/~kshen/csc256-spring2005/lectures/lecture16-file2.pdf)

Page 32: Course:        CSCI 780 – Advanced Topics on Caching

07/05/2005 32

Thank you!