44
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Networks Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kostić * KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS) + Ericsson Research

Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Networks

Alireza Farshin*, Amir Roozbeh*+, Gerald Q. Maguire Jr.*, Dejan Kostić* KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS)

+ Ericsson Research

Page 2: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Traditional I/O

2020-07-02 2

I/O Device

* Direct Memory Access (DMA)

1. I/O device DMAs* packets to main memory

2. CPU later fetches them to cache

Page 3: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Traditional I/O

2020-07-02 3

I/O Device

1. I/O device DMAs* packets to main memory

2. CPU later fetches them to cache

Inefficient:• Large number of accesses to main

memory• High access latency (>60ns)• Unnecessary memory bandwidth usage

* Direct Memory Access (DMA)

Page 4: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Direct Cache Access (DCA)

2020-07-02 4

I/O Device

* PCIe Transaction protocol Processing Hint (TPH)

1. I/O device DMAs packets to main memory

2. DCA exploits TPH* to prefetch a portion of packets into cache

3. CPU later fetches them from cachePrefetch

Page 5: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Direct Cache Access (DCA)

2020-07-02 5

I/O Device

* PCIe Transaction protocol Processing Hint (TPH)

• Still inefficient in terms of memory bandwidth usage• Requires OS intervention and support from processor

1. I/O device DMAs packets to main memory

2. DCA exploits TPH* to prefetch a portion of packets into cache

3. CPU later fetches them from cachePrefetch

Page 6: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Intel Data Direct I/O (DDIO)

2020-07-02 6

I/O Device

• DDIO in Xeon processors since Xeon E5

• DMA packets or descriptors directly to/from Last Level Cache (LLC)

Page 7: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Trends

2020-07-02 7

More in-network computing + offloading capabilities

Push costly calculations into the network and perform stateful functions at the processor, which makes applications more I/O intensive.

Page 8: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Pressure from these trends

2020-07-02 8

Multi-hundred-gigabit networks cannot tolerate memory

access and interarrival time of packets continues to shrink

Every 6.72 ns a new (64-B+20-B*) packet arrives at 100 Gbps

* 7B preamble + 1B start-of-frame delimiter +12B inter-frame gap = 20B

More in-network computing + offloading capabilities

Faster link speeds

Page 9: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

DCA matters because

2020-07-02 9

Without DCA we are unable to process I/O at line rate, thus increasing packet loss or latency when utilizing multi-hundred-gigabit networks.

Page 10: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Forwarding Packets at 100 Gbps

2020-07-02 10

Packet Generator

Device under Test Forwarding Packets

Intel Xeon Gold 6140

Mellanox ConnectX-5

100 Gbps

100 Gbps

Each NIC is placed in a PCIe 3.0 16x slot*

0

200

400

600

800

1000

1200

1400

100 Gbps 200 Gbps

99th

Perc

enti

le L

aten

cy(µ

s)

Rate

* A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth.

Page 11: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

What happens at 200 Gbps?

2020-07-02 11

Packet Generator

Device under Test Forwarding Packets

Intel Xeon Gold 6140

Mellanox ConnectX-5

2x100 Gbps

100 Gbps

Each NIC is placed in a PCIe 3.0 16x slot*

0

200

400

600

800

1000

1200

1400

100 Gbps 200 Gbps

99th

Perc

enti

le L

aten

cy (

µs)

Latency of the first NIC, when forwarding at indicated aggregate rate

100 Gbps

When forwarding at 200 Gbps, 30% higher

latency for the NIC forwarding at 100 Gbps

30%

* A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth.

Page 12: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

How does DDIO work?

2020-07-02 12

Writing packets/descriptors: C C C C

C C C C

C C C C

LogicalLLC

CPU Socket

DDIO overwrites a cache line if it is already present in any LLC ways (≡ write update or hit)

Sending/ReceivingPackets via DDIO

Write to theSame cache line

AlreadyPresent In LLC

Page 13: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

How does DDIO work?

2020-07-02 13

Writing packets/descriptors: C C C C

C C C C

C C C C

LogicalLLC

CPU Socket

DDIO overwrites a cache line if it is already present in any LLC ways (≡ write update or hit)

Sending/ReceivingPackets via DDIO

Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)

NotPresent In LLC

Allocatea cache

line

Page 14: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

How does DDIO work?

2020-07-02 14

Writing packets/descriptors: C C C C

C C C C

C C C C

LogicalLLC

CPU Socket

DDIO overwrites a cache line if it is already present in any LLC ways (≡ write update or hit)

Sending/ReceivingPackets via DDIO

Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)

Reading packets/descriptors:NIC reads a cache line if it is already present in any LLC ways (≡ read hit)

Otherwise, NIC reads it from main memory (≡ read miss)

Page 15: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

How does DDIO work?

2020-07-02 15

C C C C

C C C C

C C C C

LogicalLLC

CPU Socket

Sending/ReceivingPackets via DDIO

Designed a set of micro-benchmarks to learn about DDIO:

• Which ways are used for allocation?• How does DDIO interact with other

applications?• Does DMA via a remote CPU socket

pollute LLC?

Page 16: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 16

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

I/O Application

* Cache Allocation Technology

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

Page 17: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 17

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

Use CAT* to limit code/data

* Cache Allocation Technology

1 2 3 4 5 6 7 8 9 10 11

+ water_nsquared from Splash-3 benchmark

Page 18: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 18

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 19: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 19

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 20: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 20

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 21: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 21

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 22: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 22

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application

Contention with code/data causes a rise in the cache misses of the I/O application

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 23: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 23

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 24: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 24

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 25: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 25

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 26: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 26

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 27: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 27

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application

Contention with I/O causes a risein the cache misses of the I/O application

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data

+ water_nsquared from Splash-3 benchmark

Page 28: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

LLC ways used by DDIO

2020-07-02 28

LogicalLLC

C0

Sending/ReceivingPackets via DDIO

C1

I/O Application

Cache-sensitiveApplication+

* Cache Allocation Technology

0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application

Contention with I/O causes a risein the cache misses of the I/O application

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data See our paper

+ water_nsquared from Splash-3 benchmark

DDIO Ways

Page 29: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

How does DDIO perform?

2020-07-02 29

DDIO cannot provide expected benefits! • ResQ* [NSDI’18]• Intel reports

Write-allocate DDIO could evict not-yet-processed and already-processed packets from LLC

Packet should be read from main memory rather than LLC

Reduce the number of RX descriptors so that the buffer fit in the limited DDIO portion.

* ResQ: Enabling SLOs in Network Function Virtualization

Page 30: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Reducing #Descriptors is Not Sufficient! (1/2)

2020-07-02 30

Increasing the number of RX

descriptors and packet size

adversely affects the performance

of DDIO

DDIO cannot use the whole reserved

capacity in LLC

375 KB ≪ 4.5 MB*

* DDIO uses 2 ways out of 11 ways, i.e., 24.75 MB x 2 / 11 = 4.5 MB

Page 31: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

2020-07-02 31

Reducing #Descriptorsis Not Sufficient! (2/2)

DDIO should be able to perform well

with high number of RX descriptors!

0

20

40

60

80

100

0 2 4 6 8 10 12 14 16 18

DD

IO W

rite

-H

it R

ate

(%)

Number of Cores

Increasing the number of cores does not always improve PCIe metrics for an I/O intensive application.

Forwarding 1500-B Packets at 100 Gbps with 256 per-core RX descriptors

1500-B Packets x 256 x 18 ≈ 6.59 MB >> 4.5 MB

Page 32: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Tuning a little-discussed register can improve the performance of DDIO

2020-07-02 32

LogicalLLC

1 1 0 0 0 0 0 0 0 0 0

IIO LLC WAYS Register

Default value is 0x600

Increasing the number of bits set improves DDIO hit rates.

Page 33: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

DDIO’s effect on hit rates can affect application-level performance based on an application’s characteristics

2020-07-02 33

Impact of Tuning DDIOs)

0300600900

120015001800

512 1024 2048 4096

99th

Perc

enti

le L

ate

ncy

Number of RX Descriptors

2 bits 4 bits 6 bits 8 bits

For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps

Page 34: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

DDIO’s effect on hit rates can affect application-level performance based on an application’s characteristics

2020-07-02 34

Impact of Tuning DDIOs)

0300600900

120015001800

512 1024 2048 4096

99th

Perc

enti

le L

ate

ncy

Number of RX Descriptors

2 bits 4 bits 6 bits 8 bits

For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps

Setting more bits reduces tail latency

30%

Page 35: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Is Tuning DDIO Enough?

2020-07-02 35

Tuning is not a perfect solution! Due to:• Cache is used for code/data, • Smaller per-core cache quota, and• Coarse-grained partitions.

Next generation DCA should provide:

Fine-grained placement: Similar to CacheDirector* [EuroSys’19]I/O isolation: Extend CAT+ and CDP++ to include I/OSelective DCA/DMA: only transfer relevant parts of the packet to LLC

* Make the Most out of Last Level Cache in Intel Processors

+ Cache Allocation Technology ++ Code/Data Prioritization

Page 36: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

What about Current Systems?

2020-07-02 36

DMA should not be directed to the cache if this would cause I/O evictions!

• Disabling DDIO for a specific PCIe port• Exploiting a remote socket

Bypassing cache is beneficial in multi-tenant/application environment, where some performance isolation is desired.

Page 37: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Using Our Knowledge for 200 Gbps

2020-07-02 37

Device under Test Forwarding Packets

100 Gbps

100 Gbps

UPI

0

200

400

600

800

1000

1200

1400

100 Gbps 200 Gbps

99th

Perc

enti

le L

aten

cy (µ

s)Latency of the first NIC versus aggregate Rate

200 Gbps(4 bits)

200 Gbps(RemoteSocket)

200 Gbps(Disable)

Tuning DDIO improves packet processing at 200 Gbps

100 Gbps

Better cache management is necessary formulti-hundred-gigabit-per-second networks

Page 38: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

2020-07-02 38

Other Insights

See our paper for more results about:

• How does receiving rate affect the DDIO performance?• How does processing time affect the DDIO performance?• Is DDIO always beneficial?• Scaling up and DDIO.

We study the performance of DDIO in different scenarios

Page 39: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

2020-07-02 39

Our Key Findings (1/2)

• If an application is I/O bound, adding excessive cores could degrade its performance.

• If an application is I/O bound, tuning a little-discussed register calledIIO LLC WAYS could improve performance and lead to the same improvements as adding more cores.

• If an application starts to become CPU bound, adding more cores could improve its throughput, but it is important to balance load among cores to maximize DDIO’s benefits.

• Getting close to ~100 Gbps can cause DDIO to become a bottleneck. Therefore, it is essential to know when to bypass the cache to realize performance isolation.

Page 40: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

2020-07-02 40

Our Key Findings (2/2)

• If an application is truly CPU/memory bound, tuning DDIO is less efficient.

We now explain the impact of processing timeon the performance DDIO, which resulted in this finding.

Page 41: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

2020-07-02 41

Impact of Processing Time

100

-H

it R

ate

DD

IO M

etri

c

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90

(%)

Number of Calls

WriteReadThroughput

Device under Test

InputPacket

OutputPacket

SwappingMAC

Calling Random Number Generator

(std::mt1993)

Increasing processing time improves DDIO performance

Increasing processing time improves the

performance of DDIO

Page 42: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

1000

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90Number of Calls

WriteReadThroughput

2020-07-02 42

Impact of Processing Time

0

20

40

60

80

100

Thro

ughp

ut(G

bps)

Device under Test

InputPacket

OutputPacket

SwappingMAC

Calling Random Number Generator

(std::mt1993)

DDIO performance matters most when an application is I/O bound, rather than CPU/memory bound.

Increasing processing time reduces throughput

-H

it R

ate

DD

IO M

etri

c(%

)

Page 43: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

• DCA/DDIO should be tuned for I/O intensive applications.

• DCA/DDIO needs to be rearchitected for multi-hundred-gigabit networks.

• Benchmark your testbed with our source code.

Conclusion

2020-07-02 43

https://github.com/aliireza/ddio-bench

This work is supported by ERC, SSF, and WASP.

Page 44: Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

2020-07-02 44

Thanks for listeningDo not hesitate to contact us if you have any questions.

[email protected] and [email protected]