39
Effectively Prefetching Remote Memory with Leap Hasan Al Maruf and Mosharaf Chowdhury 1

Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Effectively Prefetching Remote Memory

with Leap Hasan Al Maruf and Mosharaf Chowdhury

1

Page 2: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

2

Memory-Intensive Applications

Page 3: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Perform Great!

3

TPC-C on VoltDB

38.61

6.61

1.010

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

Page 4: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Perform Great Until Memory Runs Out

4

TPC-C on VoltDB

38.61

6.61

1.010

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

Page 5: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Perform Great Until Memory Runs Out

5

TPC-C on VoltDB

38.61

6.61

1.010

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

PageRank on PowerGraph

116.19 124.96

424.47

0

100

200

300

400

500

100% 75% 50%

Com

plet

ion

Tim

e (s

)

In-Memory Working Set

Page 6: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

50% Less Memory Causes Slowdown of …

PageRank on PowerGraph

6

TPC-C on VoltDB

38.61

6.61

1.010

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

116.19 124.96

424.47

0

100

200

300

400

500

100% 75% 50%

Com

plet

ion

Tim

e (s

)

In-Memory Working Set

Page 7: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Between a Rock and a Hard Place

OverallocationLeads to underutilization

30-40% in Google, Alibaba, and Facebook

UnderallocationLeads to severe performance loss

VS.

7

Page 8: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Machine 1 Machine 2 Machine 3 Machine N

Used Memory Free Memory

Disaggregated Memory

Memory Disaggregation

Remote Memory 8

Page 9: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Remote Memory Access

9

User-space Applications

Memory Disaggregation Frameworks

Remote Memory

Infiniswap(NSDI’17)

Remote memory paging

Remote Regions(ATC’18)

Remote file abstraction

LegoOS(OSDI’18)

Disaggregated OS

4KB page access latency local vs. remote

100 ns vs. 4 µs

Page 10: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Remote Memory Access

10

User-space Applications

Memory Disaggregation Frameworks

Remote Memory

Infiniswap(NSDI’17)

Remote memory paging

Remote Regions(ATC’18)

Remote file abstraction

LegoOS(OSDI’18)

Disaggregated OS

[1] P. X. Gao et al. “Network requirements for resource disaggregation” OSDI’16.

Latency requirement for preferable performance[1]

3 µs

Existing frameworks can’t achieve!

4KB page access latency local vs. remote

100 ns vs. 4 µs

Page 11: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Remote Memory Access

11

User-space Applications

Memory Disaggregation Frameworks

Remote Memory

Infiniswap(NSDI’17)

Remote memory paging

Remote Regions(ATC’18)

Remote file abstraction

LegoOS(OSDI’18)

Disaggregated OS

variation in network latency

[1] P. X. Gao et al. “Network requirements for resource disaggregation” OSDI’16.

data path overhead Latency requirement for

preferable performance[1]

3 µs

Existing frameworks can’t achieve!

4KB page access latency local vs. remote

100 ns vs. 4 µs

Page 12: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Life of a Page

I/O Scheduler Request QueueRequest queue processing:

Insertion, Merging, Sorting, Staging and Dispatch

Dispatch Queue

Device Mapping Layer

Generic Block Layer bio10.04 us

2.1 us

Remote MemoryRDMA: 4.3 us

CacheMiss

0.27 usCache

Hit

User Space

Kernel SpaceMemory Management

Unit (MMU)

Process 1 Process 2 Process N…

Page FaultMMU

Page Cache

12

Block Device Driver

21.88 us

Page 13: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Where Does the Time Go?Page Request

In Page Cache?

Read Request?

Yes

Update Page Table & End I/O

Yes

0.12 µs

0.15 µs

Fast Path

13

Page 14: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Where Does the Time Go?Page Request

In Page Cache?

Allocate Cache for Page

Read Request?

No

Yes

Update Page Table & End I/OPrepare for I/O

YesNo

Queue and Batch Requests

Execute I/O

0.12 µs

2.1 µs

10.04 µs

21.88 µs

RDMA: 4.3 µs

0.15 µs

Fast Path

Slow Path

14

Page 15: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Design Goal

1. Increase cache hit• faster path serves more page faults

2. Reduce the latency of the slow path• remove unnecessary block-layer operations for RDMA

15

Page 16: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Online remote memory prefetcherLeap

Identifies memory access patterns to prefetch pages in a • fast,• cache-efficient, and• resilient manner

without modifying any • applications, or• hardware

16

Page 17: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Life of a PageUser Space

Kernel Space

Device Mapping Layer

Block Device Driver

Generic Block Layer

I/O Scheduler Request QueueRequest queue processing:

Insertion, Merging, Sorting, Staging and Dispatch

bio

Remote Memory

Dispatch Queue

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

10.04 us

21.88 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

17

Page 18: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Life of a Page w/ LeapUser Space

Kernel Space

Remote Memory

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

18

Page 19: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Life of a Page w/ LeapUser Space

Kernel Space

Remote Memory

Memory ManagementUnit (MMU)

Process 1 Process 2 Process N…

Page Fault

RDMA: 4.3 us

0.27 us

2.1 us

CacheMiss

CacheHit

MMU Page Cache

Process Specific Page Access Tracker

Leap

Trend Detection

Prefetch CandidateGeneration

Prefetcher

Eager Cache Eviction

19

0.34 us

Page 20: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Prefetching in Linux

Reads ahead pages sequentially

Based only on the last page access

Does not distinguish between processes

Cannot detect thread-level access irregularities

too aggressive on seq: cache pollution

too conservative off seq: brings nothing

20

Page 21: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Prefetching Techniques

ApproachLow Computational Complexity

Low Memory Overhead

UnmodifiedApplication

HW/SW Independence

Temporal Locality

Spatial Locality

Low Cache Pollution

Next N-Line Yes Yes Yes Yes No Yes No

Stride Yes Yes Yes Yes No Yes No

Instruction Prefetch No No No No Yes Yes No

Linux Read-Ahead Yes Yes Yes Yes Yes Yes No

Leap Yes Yes Yes Yes Yes Yes Yes

21

Page 22: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Prefetching Techniques

ApproachLow Computational Complexity

Low Memory Overhead

UnmodifiedApplication

HW/SW Independence

Temporal Locality

Spatial Locality

Low Cache Pollution

Next N-Line Yes Yes Yes Yes No Yes No

Stride Yes Yes Yes Yes No Yes No

Instruction Prefetch No No No No Yes Yes No

Linux Read-Ahead Yes Yes Yes Yes Yes Yes No

Leap Yes Yes Yes Yes Yes Yes Yes

22

Page 23: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Prefetching Techniques

ApproachLow Computational Complexity

Low Memory Overhead

UnmodifiedApplication

HW/SW Independence

Temporal Locality

Spatial Locality

Low Cache Pollution

Next N-Line Yes Yes Yes Yes No Yes No

Stride Yes Yes Yes Yes No Yes No

Instruction Prefetch No No No No Yes Yes No

Linux Read-Ahead Yes Yes Yes Yes Yes Yes No

Leap Yes Yes Yes Yes Yes Yes Yes

23

Page 24: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Prefetching Techniques

ApproachLow Computational Complexity

Low Memory Overhead

UnmodifiedApplication

HW/SW Independence

Temporal Locality

Spatial Locality

Low Cache Pollution

Next N-Line Yes Yes Yes Yes No Yes No

Stride Yes Yes Yes Yes No Yes No

Instruction Prefetch No No No No Yes Yes No

Linux Read-Ahead Yes Yes Yes Yes Yes Yes No

Leap Yes Yes Yes Yes Yes Yes Yes

24

Page 25: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Prefetching Techniques

ApproachLow Computational Complexity

Low Memory Overhead

UnmodifiedApplication

HW/SW Independence

Temporal Locality

Spatial Locality

Low Cache Pollution

Next N-Line Yes Yes Yes Yes No Yes No

Stride Yes Yes Yes Yes No Yes No

Instruction Prefetch No No No No Yes Yes No

Linux Read-Ahead Yes Yes Yes Yes Yes Yes No

Leap Yes Yes Yes Yes Yes Yes Yes

25

Page 26: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Leap PrefetcherLinear-time and constant memory space

Two main components:§ Trend detection§ Prefetch window size detection

Get Prefetch Window Size

Window Size = 0?

Read only the requested page

Trend Found?

Prefetch with Current Trend

Prefetch with Previous Trend

YesNo

No Yes

26

Page 27: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Trend DetectionStart with a smaller window of Access History

Majority found?

Doubles the window size

No Yes

Run Boyer-Moore on the window

Return Majority ∆maj

Max. window

size?

YesNo trend found

No

Flexible to short term irregularity

Identifies the majority element in access history

Regular trends can be found within recent accesses

27

Page 28: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Trend Detection Example

t4 t5 t6 t7

0x3C 0x02 0x04 0x06

t0 t1 t2 t3

0x48 0x45 0x42 0x3F-3-3-3+72 +2+2-58-3

t0 t1 t2 t3

0x48 0x45 0x42 0x3F-3-3-3+72

t8 t9 t10 t11

0x08 0x0A 0x0C 0x10+4+2+2+2 +2+2-39-41

t12 t13 t14 t15

0x39 0x12 0x14 0x16

t8 t1 t2 t3

0x08 0x45 0x42 0x3F-3-3-3+2 +2+2-58-3

t4 t5 t6 t7

0x3C 0x02 0x04 0x06

(a) at time t3 (b) at time t7

(c) at time t8 (d) at time t15

trend of -3 trend of -3 disappears, no major new trend

trend of +2 detected trend of +2 detected among irregularities

28

Page 29: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Prefetch Window Size Detection

29

Cache hit indicates prefetch utilization

High cache hit: increase prefetch window aggressively

No cache hit

Gradual slow down helps during sudden changes

trend availability: increase prefetch window gradually

no trend: decrease prefetch window gradually

Page 30: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Evaluation

Memory Disaggregation Frameworks

Deploy and evaluate over 56 Gbps InfiniBand network

30

Disaggregated VMM: Infiniswap

Disaggregated VFS: Remote Regions

Page 31: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Lowers Remote Page Access Latency by…Sequential Access

0

0.2

0.4

0.6

0.8

1

0.01 1 100 10000

CD

F

Latency (us)

Infiniswap

Infiniswap+Leap

Stride Access

0

0.2

0.4

0.6

0.8

1

0.01 1 100 10000

CD

F

Latency (us)

31

Page 32: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Efficient Pattern Detection

Detects 29.70% more sequential accesses

Detects most of the irregularity

32

Page 33: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Efficient Pattern Detection

Detects 29.70% more sequential accesses

Detects most of the irregularity

During irregularities, doing nothing helps the most

33

Page 34: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Perform Great Even After Memory Runs Out

TPC-C on VoltDB

37.00

27.74

19.33

1.50

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

Infiniswap

37 36.3 35.6

15.6

0

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

Infiniswap + Leap

34

38.61

6.61

1.010

5

10

15

20

25

30

35

40

100% 75% 50%

TPS

(T

hous

ands

)

In-Memory Working Set

Disk

TPC-C on VoltDB

Page 35: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Perform Great Even After Memory Runs Out

TPC-C on VoltDB

37.00

27.74

19.33

1.50

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

Infiniswap

37 36.3 35.6

15.6

0

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

TPC-C on VoltDB

Infiniswap + Leap

35

38.61

6.62

1.01 Fails0

5

10

15

20

25

30

35

40

100% 75% 50% 25%

TPS

(T

hous

ands

)

In-Memory Working Set

Disk

TPC-C on VoltDB

Page 36: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Benefit Breakdown of Leap’s Components

Data path optimizations: single-μs latency till 95th percentile

Prefetcher: sub-μs latency till 85th percentile

Eager cache eviction: improves the 99thpercentile latency by 22%

36

Page 37: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Future Work1. Thread-specific prefetching for multiple concurrent streams• memory is managed at the process level • this requires significant changes in virtual memory subsystem

2. Optimized remote I/O interface • load balancing, • fault-tolerance, • data locality, and • application-specific isolation in remote memory

37

Page 38: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Leap

38

Lightweight and efficient data path for remote memorysource code available at https://github.com/SymbioticLab/leap

Online prefetcher with a leaner data path and eager cache eviction policy to improve• cache hit,• remote I/O latency, and• application-level performance

without modifying any • application, or• hardware

Page 39: Effectively Prefetching Remote Memory with Leap · Life of a Page w/ Leap User Space Kernel Space Remote Memory Memory Management Unit (MMU) Process 1 Process 2 … Process N Page

Thank You!source code available at https://github.com/SymbioticLab/leap

39