Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data

Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Networks

Alireza Farshin*, Amir Roozbeh*+, Gerald Q. Maguire Jr.*, Dejan Kostić* KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS)

+ Ericsson Research

Traditional I/O

2020-07-02 2

I/O Device

* Direct Memory Access (DMA)

1. I/O device DMAs* packets to main memory

2. CPU later fetches them to cache

Traditional I/O

2020-07-02 3

I/O Device

1. I/O device DMAs* packets to main memory

2. CPU later fetches them to cache

Inefficient:• Large number of accesses to main

memory• High access latency (>60ns)• Unnecessary memory bandwidth usage

* Direct Memory Access (DMA)

Direct Cache Access (DCA)

2020-07-02 4

I/O Device

* PCIe Transaction protocol Processing Hint (TPH)

1. I/O device DMAs packets to main memory

2. DCA exploits TPH* to prefetch a portion of packets into cache

3. CPU later fetches them from cachePrefetch

Direct Cache Access (DCA)

2020-07-02 5

I/O Device

* PCIe Transaction protocol Processing Hint (TPH)

• Still inefficient in terms of memory bandwidth usage• Requires OS intervention and support from processor

1. I/O device DMAs packets to main memory

2. DCA exploits TPH* to prefetch a portion of packets into cache

3. CPU later fetches them from cachePrefetch

Intel Data Direct I/O (DDIO)

2020-07-02 6

I/O Device

• DDIO in Xeon processors since Xeon E5

• DMA packets or descriptors directly to/from Last Level Cache (LLC)

Trends

2020-07-02 7

More in-network computing + offloading capabilities

Push costly calculations into the network and perform stateful functions at the processor, which makes applications more I/O intensive.

Pressure from these trends

2020-07-02 8

Multi-hundred-gigabit networks cannot tolerate memory

access and interarrival time of packets continues to shrink

Every 6.72 ns a new (64-B+20-B*) packet arrives at 100 Gbps

* 7B preamble + 1B start-of-frame delimiter +12B inter-frame gap = 20B

More in-network computing + offloading capabilities

Faster link speeds

DCA matters because

2020-07-02 9

Without DCA we are unable to process I/O at line rate, thus increasing packet loss or latency when utilizing multi-hundred-gigabit networks.

Forwarding Packets at 100 Gbps

2020-07-02 10

Packet Generator

Device under Test Forwarding Packets

Intel Xeon Gold 6140

Mellanox ConnectX-5

100 Gbps

100 Gbps

Each NIC is placed in a PCIe 3.0 16x slot*

0

200

400

600

800

1000

1200

1400

100 Gbps 200 Gbps

99th

Perc

enti

le L

aten

cy(µ

s)

Rate

* A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth.

What happens at 200 Gbps?

2020-07-02 11

Packet Generator


Intel Xeon Gold 6140

Mellanox ConnectX-5

2x100 Gbps

100 Gbps

Each NIC is placed in a PCIe 3.0 16x slot*

0

200

400

600

800

1000

1200

1400

100 Gbps 200 Gbps

99th

Perc

enti

le L

aten

cy (

µs)

Latency of the first NIC, when forwarding at indicated aggregate rate

100 Gbps

When forwarding at 200 Gbps, 30% higher

latency for the NIC forwarding at 100 Gbps

30%

* A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth.

How does DDIO work?

2020-07-02 12

Writing packets/descriptors: C C C C

C C C C

C C C C

LogicalLLC

CPU Socket

DDIO overwrites a cache line if it is already present in any LLC ways (≡ write update or hit)

Sending/ReceivingPackets via DDIO

Write to theSame cache line

AlreadyPresent In LLC

How does DDIO work?

2020-07-02 13


C C C C

C C C C

LogicalLLC

CPU Socket



Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)

NotPresent In LLC

Allocatea cache

line

How does DDIO work?

2020-07-02 14


C C C C

C C C C

LogicalLLC

CPU Socket



Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)

Reading packets/descriptors:NIC reads a cache line if it is already present in any LLC ways (≡ read hit)

Otherwise, NIC reads it from main memory (≡ read miss)

How does DDIO work?

2020-07-02 15

C C C C

C C C C

C C C C

LogicalLLC

CPU Socket


Designed a set of micro-benchmarks to learn about DDIO:

• Which ways are used for allocation?• How does DDIO interact with other

applications?• Does DMA via a remote CPU socket

pollute LLC?

LLC ways used by DDIO

2020-07-02 16

LogicalLLC

C0


I/O Application

* Cache Allocation Technology

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data


2020-07-02 17

LogicalLLC

C0


C1

I/O Application

Cache-sensitiveApplication+



1 2 3 4 5 6 7 8 9 10 11

+ water_nsquared from Splash-3 benchmark


2020-07-02 18

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11




2020-07-02 19

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)





2020-07-02 20

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)





2020-07-02 21

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)





2020-07-02 22

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)

Ways Allocated by CAT to the Cache-sensitive Application

Contention with code/data causes a rise in the cache misses of the I/O application

1 2 3 4 5 6 7 8 9 10 11




2020-07-02 23

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)





2020-07-02 24

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)





2020-07-02 25

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)





2020-07-02 26

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)





2020-07-02 27

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)


Contention with I/O causes a risein the cache misses of the I/O application

1 2 3 4 5 6 7 8 9 10 11




2020-07-02 28

LogicalLLC

C0


C1

I/O Application



0

2

4

6

8

10

1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11

Sum

of C

ache

Mis

ses

(Mill

ion)


Contention with I/O causes a risein the cache misses of the I/O application

1 2 3 4 5 6 7 8 9 10 11

Use CAT* to limit code/data See our paper


DDIO Ways

How does DDIO perform?

2020-07-02 29

DDIO cannot provide expected benefits! • ResQ* [NSDI’18]• Intel reports

Write-allocate DDIO could evict not-yet-processed and already-processed packets from LLC

Packet should be read from main memory rather than LLC

Reduce the number of RX descriptors so that the buffer fit in the limited DDIO portion.

* ResQ: Enabling SLOs in Network Function Virtualization

Reducing #Descriptors is Not Sufficient! (1/2)

2020-07-02 30

Increasing the number of RX

descriptors and packet size

adversely affects the performance

of DDIO

DDIO cannot use the whole reserved

capacity in LLC

375 KB ≪ 4.5 MB*

* DDIO uses 2 ways out of 11 ways, i.e., 24.75 MB x 2 / 11 = 4.5 MB

2020-07-02 31

Reducing #Descriptorsis Not Sufficient! (2/2)

DDIO should be able to perform well

with high number of RX descriptors!

0

20

40

60

80

100

0 2 4 6 8 10 12 14 16 18

DD

IO W

rite

-H

it R

ate

(%)

Number of Cores

Increasing the number of cores does not always improve PCIe metrics for an I/O intensive application.

Forwarding 1500-B Packets at 100 Gbps with 256 per-core RX descriptors

1500-B Packets x 256 x 18 ≈ 6.59 MB >> 4.5 MB

Tuning a little-discussed register can improve the performance of DDIO

2020-07-02 32

LogicalLLC

1 1 0 0 0 0 0 0 0 0 0

IIO LLC WAYS Register

Default value is 0x600

Increasing the number of bits set improves DDIO hit rates.

DDIO’s effect on hit rates can affect application-level performance based on an application’s characteristics

2020-07-02 33

Impact of Tuning DDIOs)

0300600900

120015001800

512 1024 2048 4096

99th

Perc

enti

le L

ate

ncy

(µ

Number of RX Descriptors

2 bits 4 bits 6 bits 8 bits

For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps

DDIO’s effect on hit rates can affect application-level performance based on an application’s characteristics

2020-07-02 34

Impact of Tuning DDIOs)

0300600900

120015001800

512 1024 2048 4096

99th

Perc

enti

le L

ate

ncy

(µ

Number of RX Descriptors

2 bits 4 bits 6 bits 8 bits

For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps

Setting more bits reduces tail latency

30%

Is Tuning DDIO Enough?

2020-07-02 35

Tuning is not a perfect solution! Due to:• Cache is used for code/data, • Smaller per-core cache quota, and• Coarse-grained partitions.

Next generation DCA should provide:

Fine-grained placement: Similar to CacheDirector* [EuroSys’19]I/O isolation: Extend CAT+ and CDP++ to include I/OSelective DCA/DMA: only transfer relevant parts of the packet to LLC

* Make the Most out of Last Level Cache in Intel Processors

+ Cache Allocation Technology ++ Code/Data Prioritization

What about Current Systems?

2020-07-02 36

DMA should not be directed to the cache if this would cause I/O evictions!

• Disabling DDIO for a specific PCIe port• Exploiting a remote socket

Bypassing cache is beneficial in multi-tenant/application environment, where some performance isolation is desired.

Using Our Knowledge for 200 Gbps

2020-07-02 37


100 Gbps

100 Gbps

UPI

0

200

400

600

800

1000

1200

1400

100 Gbps 200 Gbps

99th

Perc

enti

le L

aten

cy (µ

s)Latency of the first NIC versus aggregate Rate

200 Gbps(4 bits)

200 Gbps(RemoteSocket)

200 Gbps(Disable)

Tuning DDIO improves packet processing at 200 Gbps

100 Gbps

Better cache management is necessary formulti-hundred-gigabit-per-second networks

2020-07-02 38

Other Insights

See our paper for more results about:

• How does receiving rate affect the DDIO performance?• How does processing time affect the DDIO performance?• Is DDIO always beneficial?• Scaling up and DDIO.

We study the performance of DDIO in different scenarios

2020-07-02 39

Our Key Findings (1/2)

• If an application is I/O bound, adding excessive cores could degrade its performance.

• If an application is I/O bound, tuning a little-discussed register calledIIO LLC WAYS could improve performance and lead to the same improvements as adding more cores.

• If an application starts to become CPU bound, adding more cores could improve its throughput, but it is important to balance load among cores to maximize DDIO’s benefits.

• Getting close to ~100 Gbps can cause DDIO to become a bottleneck. Therefore, it is essential to know when to bypass the cache to realize performance isolation.

2020-07-02 40

Our Key Findings (2/2)

• If an application is truly CPU/memory bound, tuning DDIO is less efficient.

We now explain the impact of processing timeon the performance DDIO, which resulted in this finding.

2020-07-02 41

Impact of Processing Time

100

-H

it R

ate

DD

IO M

etri

c

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90

(%)

Number of Calls

WriteReadThroughput

Device under Test

InputPacket

OutputPacket

SwappingMAC

Calling Random Number Generator

(std::mt1993)

Increasing processing time improves DDIO performance

Increasing processing time improves the

performance of DDIO

1000

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90Number of Calls

WriteReadThroughput

2020-07-02 42

Impact of Processing Time

0

20

40

60

80

100

Thro

ughp

ut(G

bps)

Device under Test

InputPacket

OutputPacket

SwappingMAC

Calling Random Number Generator

(std::mt1993)

DDIO performance matters most when an application is I/O bound, rather than CPU/memory bound.

Increasing processing time reduces throughput

-H

it R

ate

DD

IO M

etri

c(%

)

• DCA/DDIO should be tuned for I/O intensive applications.

• DCA/DDIO needs to be rearchitected for multi-hundred-gigabit networks.

• Benchmark your testbed with our source code.

Conclusion

2020-07-02 43

https://github.com/aliireza/ddio-bench

This work is supported by ERC, SSF, and WASP.

https://github.com/aliireza/ddio-bench

2020-07-02 44

Thanks for listeningDo not hesitate to contact us if you have any questions.

[email protected] and [email protected]

mailto:[email protected]

mailto:[email protected]

Documents

Reexamining Direct Cache Access to Optimize I/O Intensive … · * Cache Allocation Technology. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Use CAT* to . limit code/data