Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Networks
Alireza Farshin*, Amir Roozbeh*+, Gerald Q. Maguire Jr.*, Dejan Kostić* KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS)
+ Ericsson Research
Traditional I/O
2020-07-02 2
I/O Device
* Direct Memory Access (DMA)
1. I/O device DMAs* packets to main memory
2. CPU later fetches them to cache
Traditional I/O
2020-07-02 3
I/O Device
1. I/O device DMAs* packets to main memory
2. CPU later fetches them to cache
Inefficient:• Large number of accesses to main
memory• High access latency (>60ns)• Unnecessary memory bandwidth usage
* Direct Memory Access (DMA)
Direct Cache Access (DCA)
2020-07-02 4
I/O Device
* PCIe Transaction protocol Processing Hint (TPH)
1. I/O device DMAs packets to main memory
2. DCA exploits TPH* to prefetch a portion of packets into cache
3. CPU later fetches them from cachePrefetch
Direct Cache Access (DCA)
2020-07-02 5
I/O Device
* PCIe Transaction protocol Processing Hint (TPH)
• Still inefficient in terms of memory bandwidth usage• Requires OS intervention and support from processor
1. I/O device DMAs packets to main memory
2. DCA exploits TPH* to prefetch a portion of packets into cache
3. CPU later fetches them from cachePrefetch
Intel Data Direct I/O (DDIO)
2020-07-02 6
I/O Device
• DDIO in Xeon processors since Xeon E5
• DMA packets or descriptors directly to/from Last Level Cache (LLC)
Trends
2020-07-02 7
More in-network computing + offloading capabilities
Push costly calculations into the network and perform stateful functions at the processor, which makes applications more I/O intensive.
Pressure from these trends
2020-07-02 8
Multi-hundred-gigabit networks cannot tolerate memory
access and interarrival time of packets continues to shrink
Every 6.72 ns a new (64-B+20-B*) packet arrives at 100 Gbps
* 7B preamble + 1B start-of-frame delimiter +12B inter-frame gap = 20B
More in-network computing + offloading capabilities
Faster link speeds
DCA matters because
2020-07-02 9
Without DCA we are unable to process I/O at line rate, thus increasing packet loss or latency when utilizing multi-hundred-gigabit networks.
Forwarding Packets at 100 Gbps
2020-07-02 10
Packet Generator
Device under Test Forwarding Packets
Intel Xeon Gold 6140
Mellanox ConnectX-5
100 Gbps
100 Gbps
Each NIC is placed in a PCIe 3.0 16x slot*
0
200
400
600
800
1000
1200
1400
100 Gbps 200 Gbps
99th
Perc
enti
le L
aten
cy(µ
s)
Rate
* A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth.
What happens at 200 Gbps?
2020-07-02 11
Packet Generator
Device under Test Forwarding Packets
Intel Xeon Gold 6140
Mellanox ConnectX-5
2x100 Gbps
100 Gbps
Each NIC is placed in a PCIe 3.0 16x slot*
0
200
400
600
800
1000
1200
1400
100 Gbps 200 Gbps
99th
Perc
enti
le L
aten
cy (
µs)
Latency of the first NIC, when forwarding at indicated aggregate rate
100 Gbps
When forwarding at 200 Gbps, 30% higher
latency for the NIC forwarding at 100 Gbps
30%
* A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth.
How does DDIO work?
2020-07-02 12
Writing packets/descriptors: C C C C
C C C C
C C C C
LogicalLLC
CPU Socket
DDIO overwrites a cache line if it is already present in any LLC ways (≡ write update or hit)
Sending/ReceivingPackets via DDIO
Write to theSame cache line
AlreadyPresent In LLC
How does DDIO work?
2020-07-02 13
Writing packets/descriptors: C C C C
C C C C
C C C C
LogicalLLC
CPU Socket
DDIO overwrites a cache line if it is already present in any LLC ways (≡ write update or hit)
Sending/ReceivingPackets via DDIO
Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)
NotPresent In LLC
Allocatea cache
line
How does DDIO work?
2020-07-02 14
Writing packets/descriptors: C C C C
C C C C
C C C C
LogicalLLC
CPU Socket
DDIO overwrites a cache line if it is already present in any LLC ways (≡ write update or hit)
Sending/ReceivingPackets via DDIO
Otherwise, DDIO allocates a cache line in a limited portion of LLC (≡ write allocate or miss)
Reading packets/descriptors:NIC reads a cache line if it is already present in any LLC ways (≡ read hit)
Otherwise, NIC reads it from main memory (≡ read miss)
How does DDIO work?
2020-07-02 15
C C C C
C C C C
C C C C
LogicalLLC
CPU Socket
Sending/ReceivingPackets via DDIO
Designed a set of micro-benchmarks to learn about DDIO:
• Which ways are used for allocation?• How does DDIO interact with other
applications?• Does DMA via a remote CPU socket
pollute LLC?
LLC ways used by DDIO
2020-07-02 16
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
I/O Application
* Cache Allocation Technology
1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
LLC ways used by DDIO
2020-07-02 17
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
Use CAT* to limit code/data
* Cache Allocation Technology
1 2 3 4 5 6 7 8 9 10 11
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 18
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 19
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 20
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 21
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 22
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application
Contention with code/data causes a rise in the cache misses of the I/O application
1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 23
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 24
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 25
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 26
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 27
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application
Contention with I/O causes a risein the cache misses of the I/O application
1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data
+ water_nsquared from Splash-3 benchmark
LLC ways used by DDIO
2020-07-02 28
LogicalLLC
C0
Sending/ReceivingPackets via DDIO
C1
I/O Application
Cache-sensitiveApplication+
* Cache Allocation Technology
0
2
4
6
8
10
1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11
Sum
of C
ache
Mis
ses
(Mill
ion)
Ways Allocated by CAT to the Cache-sensitive Application
Contention with I/O causes a risein the cache misses of the I/O application
1 2 3 4 5 6 7 8 9 10 11
Use CAT* to limit code/data See our paper
+ water_nsquared from Splash-3 benchmark
DDIO Ways
How does DDIO perform?
2020-07-02 29
DDIO cannot provide expected benefits! • ResQ* [NSDI’18]• Intel reports
Write-allocate DDIO could evict not-yet-processed and already-processed packets from LLC
Packet should be read from main memory rather than LLC
Reduce the number of RX descriptors so that the buffer fit in the limited DDIO portion.
* ResQ: Enabling SLOs in Network Function Virtualization
Reducing #Descriptors is Not Sufficient! (1/2)
2020-07-02 30
Increasing the number of RX
descriptors and packet size
adversely affects the performance
of DDIO
DDIO cannot use the whole reserved
capacity in LLC
375 KB ≪ 4.5 MB*
* DDIO uses 2 ways out of 11 ways, i.e., 24.75 MB x 2 / 11 = 4.5 MB
2020-07-02 31
Reducing #Descriptorsis Not Sufficient! (2/2)
DDIO should be able to perform well
with high number of RX descriptors!
0
20
40
60
80
100
0 2 4 6 8 10 12 14 16 18
DD
IO W
rite
-H
it R
ate
(%)
Number of Cores
Increasing the number of cores does not always improve PCIe metrics for an I/O intensive application.
Forwarding 1500-B Packets at 100 Gbps with 256 per-core RX descriptors
1500-B Packets x 256 x 18 ≈ 6.59 MB >> 4.5 MB
Tuning a little-discussed register can improve the performance of DDIO
2020-07-02 32
LogicalLLC
1 1 0 0 0 0 0 0 0 0 0
IIO LLC WAYS Register
Default value is 0x600
Increasing the number of bits set improves DDIO hit rates.
DDIO’s effect on hit rates can affect application-level performance based on an application’s characteristics
2020-07-02 33
Impact of Tuning DDIOs)
0300600900
120015001800
512 1024 2048 4096
99th
Perc
enti
le L
ate
ncy
(µ
Number of RX Descriptors
2 bits 4 bits 6 bits 8 bits
For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps
DDIO’s effect on hit rates can affect application-level performance based on an application’s characteristics
2020-07-02 34
Impact of Tuning DDIOs)
0300600900
120015001800
512 1024 2048 4096
99th
Perc
enti
le L
ate
ncy
(µ
Number of RX Descriptors
2 bits 4 bits 6 bits 8 bits
For example, an I/O intensive application: 2 cores forwarding 1500-B Packets at 100 Gbps
Setting more bits reduces tail latency
30%
Is Tuning DDIO Enough?
2020-07-02 35
Tuning is not a perfect solution! Due to:• Cache is used for code/data, • Smaller per-core cache quota, and• Coarse-grained partitions.
Next generation DCA should provide:
Fine-grained placement: Similar to CacheDirector* [EuroSys’19]I/O isolation: Extend CAT+ and CDP++ to include I/OSelective DCA/DMA: only transfer relevant parts of the packet to LLC
* Make the Most out of Last Level Cache in Intel Processors
+ Cache Allocation Technology ++ Code/Data Prioritization
What about Current Systems?
2020-07-02 36
DMA should not be directed to the cache if this would cause I/O evictions!
• Disabling DDIO for a specific PCIe port• Exploiting a remote socket
Bypassing cache is beneficial in multi-tenant/application environment, where some performance isolation is desired.
Using Our Knowledge for 200 Gbps
2020-07-02 37
Device under Test Forwarding Packets
100 Gbps
100 Gbps
UPI
0
200
400
600
800
1000
1200
1400
100 Gbps 200 Gbps
99th
Perc
enti
le L
aten
cy (µ
s)Latency of the first NIC versus aggregate Rate
200 Gbps(4 bits)
200 Gbps(RemoteSocket)
200 Gbps(Disable)
Tuning DDIO improves packet processing at 200 Gbps
100 Gbps
Better cache management is necessary formulti-hundred-gigabit-per-second networks
2020-07-02 38
Other Insights
See our paper for more results about:
• How does receiving rate affect the DDIO performance?• How does processing time affect the DDIO performance?• Is DDIO always beneficial?• Scaling up and DDIO.
We study the performance of DDIO in different scenarios
2020-07-02 39
Our Key Findings (1/2)
• If an application is I/O bound, adding excessive cores could degrade its performance.
• If an application is I/O bound, tuning a little-discussed register calledIIO LLC WAYS could improve performance and lead to the same improvements as adding more cores.
• If an application starts to become CPU bound, adding more cores could improve its throughput, but it is important to balance load among cores to maximize DDIO’s benefits.
• Getting close to ~100 Gbps can cause DDIO to become a bottleneck. Therefore, it is essential to know when to bypass the cache to realize performance isolation.
2020-07-02 40
Our Key Findings (2/2)
• If an application is truly CPU/memory bound, tuning DDIO is less efficient.
We now explain the impact of processing timeon the performance DDIO, which resulted in this finding.
2020-07-02 41
Impact of Processing Time
100
-H
it R
ate
DD
IO M
etri
c
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90
(%)
Number of Calls
WriteReadThroughput
Device under Test
InputPacket
OutputPacket
SwappingMAC
Calling Random Number Generator
(std::mt1993)
Increasing processing time improves DDIO performance
Increasing processing time improves the
performance of DDIO
1000
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90Number of Calls
WriteReadThroughput
2020-07-02 42
Impact of Processing Time
0
20
40
60
80
100
Thro
ughp
ut(G
bps)
Device under Test
InputPacket
OutputPacket
SwappingMAC
Calling Random Number Generator
(std::mt1993)
DDIO performance matters most when an application is I/O bound, rather than CPU/memory bound.
Increasing processing time reduces throughput
-H
it R
ate
DD
IO M
etri
c(%
)
• DCA/DDIO should be tuned for I/O intensive applications.
• DCA/DDIO needs to be rearchitected for multi-hundred-gigabit networks.
• Benchmark your testbed with our source code.
Conclusion
2020-07-02 43
https://github.com/aliireza/ddio-bench
This work is supported by ERC, SSF, and WASP.
2020-07-02 44
Thanks for listeningDo not hesitate to contact us if you have any questions.