37
100 GFTP: An Ultra-High Speed Data Transfer Service Over Next Generation 100 Gigabit Per Second Network Dantong Yu Stony Brook University/Brookhaven National Lab

100 GFTP: An Ultra-High Speed Data Transfer Service Over Next Generation 100 Gigabit Per Second Network Dantong Yu Stony Brook University/Brookhaven National

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

100 GFTP: An Ultra-High Speed Data Transfer Service Over Next Generation

100 Gigabit Per Second Network

Dantong YuStony Brook University/Brookhaven

National Lab

Outline

• Project Personnel Update– Dantong Yu, – Thomas Robertazzi– Post-doctoral associates

• Qian Chen (September/27/2010)• Shudong Jin (Oct./01/2010)

– Student members: Yufen Ren, Tan Li, Rajat Sharma

• Project Introduction and challenges• Software Architecture• Project Plan and Intermediate Testbed• Technical Discussion between RDMA v.s. TCP

End-to-End 100G Networking

100G APPS 100G APPS

FTP 100FTP 100

100G NIC 100G NIC

100 Gbits/s Backbone100 Gbits/s Backbone

100 G APPS 100 G APPS

FTP 100FTP 100

100G NIC 100G NIC

End-to-End Networking at 100 Gbits/sEnd-to-End Networking at 100 Gbits/s

Our project and its role

Problems Definition and ScopeConventional data transfer protocol (TCP/IP) and file I/O

have performance gaps.Reliable transfer (error checking and recovery) at 100G

speed Coordinated data transfer flow efficiently traverses file

systems and network, data path decomposition data read-in: from source disk to user memory (backend data path)

Need External Collaborators to work on this together

Transport: Source memory to destination host memory (frontend data path) Data write-out: from user memory to destination disks (backend data path)

Cost-effective end to end data transfer (10x10GE v.s. 1X100GE) from sources to sinks Reduced port counts.

Challenges (Manageable)• Host System Bottlenecks:

– Intel Architecture: Quick Path Interface:• Theoretical Rate: 6.4 GT/s, 6.4*16(effective link width)*2 (two links for

bidirectional)/8 = 25.6GByptes – AMD Architecture: HyperTransport

• For HT 3.1, 16 bits bus width gives the same rate 25.6GBytes.• Requires: PCI-2.0 (500MB per lane) x 16 = 8GB (one direction) is required.

– PCI and PCI-based network card• All NIC are PCI-2.0 (500MB per lane) x 8 = 4GB (one direction) • Fastest PCI-2.0 (500MB per lane) x 16 = 8GB (one direction) is required for 40Gbps.• PCI-3.0 x 16 which doubles the speed of PCI-2.0 is required for 100Gbps.

Challenges with some uncertainties and Proposed Solution

• File System Bottlenecks: how to do file stage-in/out– Kernel/software stacks slow, the same problem as TCP. – Look into the zero Copy, Data was moved into the user space in one copy. – Fopen, sendfile, O_DIRECT, each has some problem or restriction.– Look into Lustre RDMA to pull data directly into the user space. – Can a single file client (single server) pull files in the speed of 100Gbps?– Look for collaborators who have this type of expertise.

• Storage:– Need to support 100Gbps In/Out by disk spindles– Multiple RAID controllers (large cache).

• LSI 3ware supports up to 2.5GByte/second READ.

– Multiple RAID controllers to accomplish 100Gbps.• i.e. Multiple files need to be streamed into buffer in parallel.

– Switch fabric interconnects disk servers and FTP 100 servers.• Storage Aggregation from disks into FTP server disk partition.

– Look for collaborators who have this type of expertise.

FTP 100 Design Challenges

• Such high performance data transfer requires multiple file R/W.

• Implement the buffer management (stream multiple files into a buffer in the system memory or NIC card memory), and provide handshake with the backend file systems

• Challenge of synchronization between Read/Write

End System Multi-Layer Capability View

8

QoS MPLS

IPMPLS

TeraPathsServices

FTP 100 Over RDMA

Applications/Climate 100/OSGSRM/BeSTMan

Appl

icati

on,

Mid

dlew

are

secu

rity

AA Plane Control Plane

Service Plane

Management Plane

Management Plane

AALaye

r 3

Appl

icati

on,

Mid

dlew

are

Laye

r

Security

Security

Security

TCPUDP

TeraPathsServices

Management Plane

AALaye

r 4

SecurityTeraPathsControl

Layer2 Control VLANs/Ethernet/IB OSCARS

ServicesAAManagement Plane

Laye

r 2

Security

Appl

icati

on,

Mid

dlew

are

Man

agem

ent

Layer1 Control

G709 in Acadia 100G NIC

Layer 1 ServicesAA

Management PlaneLa

yer 1

/2

Security

Not in implementation

Not in implementation

Leverage existing systems

Leverage existing systems

DOE 100G ANI

FTP Develop with OpenFabrics

LowerUser Space

Kernel Space

HardwareiWARP RNIC

EthernetInfiniBand HCA

InfiniBand Fabric

iWARP driver IB driver

OpenFabrics Kernel Modules

InfiniBand Verbs

RDMACM

Application ftp

Memory reg Q manage Verbs

Lustre

Communication

File Operation

Cluster File Systems

• rdmacm: rdma communication. User space libraries for establishing RDMA communication. Includes both Infiniband specific and general RDMA communications management libraries for unreliable datagram, reliable connected, and multi-cast data transfers.

• libibverbs is a library that allows userspace processes to use InfiniBand/RDMA "verbs" directly.

OFED

An example of ftp via OpenFabric

Put

Get

RDMA FTP Client

RDMA FTP Serverrdma_getaddrinfo()

rdma_create_ep()

rdma_listen()

rdma_accept()

blocks until connection from client

rdma_get_recv_comp()

rdma_post_send()

rdma_connect()

rdma_post_send()

rdma_get_recv_comp()

rdma_disconnect()

connection establishment

data

data

rdma_getaddrinfo()

rdma_create_ep()

rdma_deg_mr()

rdma_destroy_ep()

rdma_disconnect()

rdma_deg_mr()

rdma_destroy_ep()

FTPProtocol

FS

One Year Roadmap

08/10 10/10 12/10 04/11 06/11

FTP Version 0.1OpenBSD FTP + RDMASingle file transfer

25 Gbps Lustre testbed

FTP Version 0.2multiple file parallel transfer functionLustre file system support

02/11 07/11

40 Gbps Lustre Testbed

FTP Version 0.3Bug fix and performance improvement

In-house back-to-back 25+10Gbps data transfer test

FTP Version 1.0 Support Acadia Emulated 40 Gbps NIC

FTP 100 and Acadia Integration into BNL 40Gbps infrastructure

40Gbps Performance Test with all

Iperf+RDMA for data and file transfer

25 Gbps Lustre System Testbed IN Plan

IB

IBM System x3650 M3 with RAM disks/SSD

IBM System x3650 M3 with RAM disks/SSDMellanox Switch

Front-end connection 40GE (32Ggps, PCI e2 x8= (500MBypte*8)Backend Each server with 4 SAS drives (4*150MByte=600Mbyte/read/write)The total speed of six servers =3600M byte/read/write per second BNL will pay for the this Dell cluster

Lustre File System

OSS

OSS

OSS

OSS

OSS

OSS

40 Gbps Data Transfer Testbed for December/2010

IBM System x3650 M3 with RAM disks/SSD

IBM System x3650 M3 with RAM disks/SSD

Mellanox 40 G IB switchMellanox 40 G IB switch

Acadia NIC

Leverage DOE ANI Tabletop Storage or buy our own storage

Leverage DOE ANI Tabletop Storage or buy our own storage

IB IB

Mellanox 40GEIB

100 Gbps Data Transfer Testbed Proposal

IBM System x3650 M3 with RAM disks/SSD

IBM System x3650 M3 with RAM disks/SSD

Mellanox 40 G IB switchMellanox 40 G IB switch

Acadia NIC

Leverage DOE ANI Tabletop Storage or buy our own storage

Leverage DOE ANI Tabletop Storage or buy our own storage

IB IB

Sto

rage

IB

DOE 100 Gbits/s Backbone or

In-lab fiber factory

DOE 100 Gbits/s Backbone or

In-lab fiber factory

Remote Site

Acad

ia N

IC

IB

Conclusion

I. For one data transfer stream, the RDMA transport is twice as fast as TCP, while the RDMA has only 10% of CPU load compare with the CPU load under TCP, without disk operation.

II. FTP includes two components: Networking and File operation. Compare with the RDMA operation, file operation (limited by the disk performance) takes most of the CPU usage. Therefore, a well-designed file buffer mode is critical.

Future work

• Setup Lustre environment, and configure Lustre with RDMA function

• Start FTP migration to RDMAo Source controlo Bug databaseo Documento Unit Test

SOME PRELIMINARY RESULTS

Current Environment

40 Gbps Mellanox Ethernet: This link can support both RDMA and TCP

Netqos03/client Netqos04/server

Tool - iperf• Migrate iperf 2.0.5 to the RDMA environment

with OFED(librdmacm and libibverbs).• 2000+ Source Lines of Code added.

• From 8382 to 10562.• iperf usage extended

– -H: RDMA transfer mode instead of TCP/UDP– -G: pr(passive read) pw(passive write)

– Data read from server. – Server writes into clients.

– -O: output data file, both TCP server and RDMA server

• Only one stream to transfer

Test Suites• test suits 1: memory -> memory

• test suits 2: file -> memory -> memory– test case 2.1: file(regular file) -> memory -> memory– test case 2.2: file(/dev/zero) -> memory -> memory

• test suits 3: memory -> memory -> file– test case 3.1: memory -> memory -> file(regular file)– test case 3.2: memory -> memory -> file(/dev/null)

• test suits 4: file -> memory -> memory -> file– test case 4.1: file ( regular file) -> memory -> memory -> file( regular file)– test case 4.2: file(/dev/zero) -> memory -> memory -> file(/dev/null)

File choice• File operation with Standard I/O library

– fread, fwrite, Cached by OS

• Input with /dev/zero wants to test the maximum application data transfer include file operation – read, which means disk is not the bottleneck

• Output with /dev/null wants to test the maximum application data transfer include file operation – write, which means disk is not the bottleneck

Buffer choice

• RDMA operation block size is 10MB– RDMA READ/WRITE one time

– Previous experiment shows that, in this environment, if the block size more than 5MB, there is little effect to the transfer speed

• TCP read/write buffer size is the default

• TCP window size: 85.3 KByte (default)

Test case 1: memory -> memory CPU

Test case 1: memory -> memory Bandwidth

Test case 2.1: (fread)file(regular file) -> memory -> memory CPU

Test case 2.1: (fread)file(regular file) -> memory -> memory

Bandwidth

Test case 2.2 (five minutes) file(/dev/zero) -> memory -> memory CPU

Test case 2.2 (five minutes) file(/dev/zero) -> memory -> memory Bandwidth

Test case 3.1 (200G file are generated): memory -> memory -> file(regular file) CPU

Test case 3.1 (200G file are generated): memory -> memory -> file(regular file)

Bandwidth

Testcase 3.2: memory -> memory -> file(/dev/null) CPU

Testcase 3.2: memory -> memory -> file(/dev/null) Bandwidth

Test case 4.1:file(r) -> memory -> memory -> file(r) CPU

Test case 4.1:file(r) -> memory -> memory -> file(r) Bandwidth

Test case 4.2:file(/dev/zero) -> memory -> memory -> file(/dev/null)

CPU

Test case 4.2:file(/dev/zero) -> memory -> memory -> file(/dev/null)

Bandwidth

08/10 10/10 12/10 04/11 06/11

FTP Version 0.1OpenBSD FTP + RDMASingle file transfer

25 Gbps Lustre testbed

FTP Version 0.2multiple file parallel transfer functionLustre file system support

02/11 07/11

40 Gbps Lustre Testbed

FTP Version 0.3Bug fix and performance improvement

In-house back-to-back 25+10Gbps data transfer test

FTP Version 1.0 Support Acadia Emulated 40 Gbps NIC

FTP 100 and Acadia Integration into BNL 40Gbps infrastructure

40Gbps Performance Test with all

Iperf+RDMA for data and file transfer