Upload
others
View
6
Download
1
Embed Size (px)
Citation preview
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Performance Implications Libiscsi RDMA support
Roy Shterman Software Engineer, Mellanox
Sagi Grimberg Principal architect, Lightbits labs
Shlomo Greenberg Phd. Electricity and computer department
Ben-Gurion University, Israel
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Agenda
Introduction to Libiscsi Introduction to iSER Libiscsi/iSER implementation The memory Challenge in user-space RDMA Performance results Future work
2
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
What is Libiscsi?
iSCSI initiator user-space implementation. High performance non-blocking async API. Mature. Permissive license (GPL). Portable, OS independent. Fully integrated in Qemu. Written and maintained by Ronnie Sahlberg
[https://github.com/sahlberg/Libiscsi] 3
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Why Libiscsi?
Originally developed to provide built-in iSCSI client side support for KVM/QEMU.
Process private Logical Units (LUNs) without the need to have root permissions.
Since, grew iSCSI/SCSI compliance test-suits.
4
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
iSCSI Extensions for RDMA (iSER)
Part of IETF RFC-7147
Transport layer iSER or iSCSI/TCP are transparent to the user.
5
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
iSER benefits
Zero-Copy CPU offload Fabric reliability High IOPs, Low latency Inherits iSCSI management Fabric/Hardware consolidation InfiniBand and/or Ethernet (RoCE/iWARP)
6
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
iSER Read command flow
SCSI Reads Initiator send Protocol Data Unit with encapsulated SCSI read to target. Target writes the data into Initiator buffers with RDMA_WRITE
command. Target initiate Response to the Initiator that will complete the SCSI
command.
7
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
iSER Write command flow
SCSI Writes Initiator send Protocol Data Unit with encapsulated SCSI write to target (can
contain also inline data to improve latency). Target reads the data from initiator buffers with RDMA_READ commands. Target initiate Response to the Initiator that will complete the SCSI command.
8
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Libiscsi iSER implementation
Transparent integration. User-space networking (kernel bypass). High performance. Separation of data and control plane. Reduce latency by using non-blocking fd polling.
9
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Libiscsi stack modification
Layered the stack Centralized transport specific code Added a nice transport abstraction API Plugged in iSER
10
typedef struct iscsi_transport { int (*connect)(struct iscsi_context *iscsi, union socket_address *sa, int ai_family); int (*queue_pdu)(struct iscsi_context *iscsi, struct iscsi_pdu *pdu); struct iscsi_pdu* (*new_pdu)(struct iscsi_context *iscsi, size_t size); int (*disconnect)(struct iscsi_context *iscsi); void (*free_pdu)(struct iscsi_context *iscsi, struct iscsi_pdu *pdu); int (*service)(struct iscsi_context *iscsi, int revents); int (*get_fd)(struct iscsi_context *iscsi); int (*which_events)(struct iscsi_context *iscsi); } iscsi_transport;
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
QEMU iSER support
Qemu iSCSI block driver needed some modifications to support iSER. Move polling logic to the transport layer. Pass IO vectors to the transport stack.
Work in progress should be available in the next few weeks.
Also through libvirt!
11
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Experiments and results
Performance measured with Mellanox ConnectX4 on both initiator and target.
Target side was TGT user-space iSCSI target with RAM storage devices.
IO generator was FIO (Flexible I/O tester). Each guest with single CPU core and single FIO
process. Comparison against iSCSI/TCP and block
device pass-through of iSER devices.
12
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
13
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
0 20 40 60 80 100 120 140
IOPS
I/O Depth
IOPS vs I/O depth
iSER Libiscsi TCP Libiscsi iSER kernel PT
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
14
0
1000000
2000000
3000000
4000000
5000000
6000000
0 20 40 60 80 100 120 140
Band
wid
th (
KB/
s)
Block Size (K)
Bandwidth vs Block size
iSER Libiscsi TCP Libiscsi iSER kernel PT
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
15
0
1000
2000
3000
4000
5000
6000
7000
8000
0 20 40 60 80 100 120 140
Late
ncy
(us)
I/O Depth
Latency vs I/O depth
iSER Libiscsi TCP Libiscsi iser PT Latency
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
16
0
50
100
150
200
250
300
350
400
450
500
1k 2k 4k 8k 16k 32k 64k 128k
Late
ncy
(us)
Block Size
Latency vs Block Size
iSER Libiscsi TCP Libiscsi iSER kernel PT
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
17
0
500000
1000000
1500000
2000000
2500000
3000000
1 2 3 4
Band
wid
th (
KB/
s)
num of VMs
Bandwidth across multiple VMs
iSER Libiscsi TCP Libiscsi iSER kernel PT
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
18
0
100000
200000
300000
400000
500000
600000
700000
800000
1 2 3 4
IOPS
num of VMs
IOPS across multiple VMs
iSER Libiscsi TCP Libiscsi iSER kernel PT
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
RDMA Memory registration
In order to allow remote access the application needs to map the buffer with remote access permissions.
Mapping operation is slow and not suite-able for the data-plane.
Applications usually preregister all the buffers intended for networking and RDMA
19
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Memory registration in Mid-layers
Mid-layers often don't own the buffers but rather receives them from the application. Examples: OpenMPI, SHMEM and
Libiscsi/iSER Memory registration for each data-transfer is not
acceptable.
20
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Possible solutions
1) Pre-register the entire application space. 2) Modify applications to use Mid-layer buffers. 3) “Pin-down” Cache: Register and cache
mappings on the fly. 4) Page-able RDMA (ODP): Let the device and the
kernel handle IO page-faults
21
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
RDMA paging - ODP
RDMA devices can supports IO page-faults App can register “huge” virtual memory region
(even entire memory space). HW and kernel handle page-faults and page
invalidations If locality is good enough, performance penalty
is amortized. Not bounded to physical memory.
22
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
iSER with ODP and memory windows
iSER can leverage ODP for a more efficient data-path
But, cannot map non-IO related memory for remote access. Solution: Open a memory window on a page-able
memory region (fast operation – can be used in the data-path).
ODP support for memory windows is on the works. Initial experiments with ODP look promising. 23
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Future Work
Leveraging RDMA paging support to reduce the memory foot-print.
Plenty of room for performance optimizations. Stability improvements. Libiscsi iSER unit tests.
24
2016 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Acknowledgments
This project was conducted under the supervision and guidance of Dr. Shlomo Greenberg, Ben-Gurion University.
Special thanks to Ronnie Sahlberg, creator and maintainer of the Libiscsi library for his support.
25