30
Designing Next Generation FS for NVMe and NVMe-oF Liran Zvibel CTO, Co-founder Weka.IO @liranzvibel Flash Memory Summit 2017 Santa Clara, CA 1

Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

  • Upload
    donhu

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Designing Next Generation FS for NVMe and NVMe-oF

Liran ZvibelCTO, Co-founder Weka.IO

@liranzvibelFlash Memory Summit 2017Santa Clara, CA 1

Page 2: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Designing Next Generation FS for NVMe and NVMe-oF

Liran ZvibelCTO, Co-founder Weka.IO

@liranzvibelFlash Memory Summit 2017Santa Clara, CA 2

Clustered

Page 3: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Does it work?

• You’re invited to the Intel Partner booth to see• Show GUI of the WRG

• We’ll talk about creating filesystems that scale to 100s of PB providing 100 of M of IOPS at very low latencies

Flash Memory Summit 2017Santa Clara, CA 3

Page 4: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Why are we here?

• Most local file systems are not NAND FLASH optimized

• No high performance clustered FS created for a very long time

Flash Memory Summit 2017Santa Clara, CA 4

Page 5: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

FS for NVMe and NVMe-oF ?!?

• Right, there are already Block based solutions leveraging NVMe-oF

• The shared FS world has not moved up to the challenge yet

Flash Memory Summit 2017Santa Clara, CA 5

Page 6: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Mission Statement

• Architect the best scale-out distributed file system based on NAND FLASH, optimized for NVMe and high speed networking

• Run natively in the cloud and scale much beyond the rack level

Flash Memory Summit 2017Santa Clara, CA 6

Page 7: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

What can be improved• Random 4k IOPS and with low latency

• Metadata performance

• Affordable throughput

• Write amplificationFlash Memory Summit 2017Santa Clara, CA 7

Page 8: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Changes even on a local scale

• Trees are horrible for NAND FLASH• So even though FS is a hierarchal in nature, other

means need to support it• atime updates means that reads create writes

• Horrible from write amplification standpoint

Flash Memory Summit 2017Santa Clara, CA 8

Page 9: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Data protection must be flash friendly

• Triple replication too expensive for flash (or throughput!)

• Protection should be coded, large clusters can go 16+2 with very little protection overhead both on FLASH and Network

Flash Memory Summit 2017Santa Clara, CA 9

Page 10: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

FLASH means more metadata power needed

• Current clustered filesystems are good at providing high throughput over many HDDs

• FLASH has the ability to provide much more IOPS and metadata operations

• Older filesystems did not try to optimize for 4k workloads, as HDDs could not carry them

Flash Memory Summit 2017Santa Clara, CA 10

Page 11: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Solving metadata

• The metadata processing associated with distributed FS is massive

• Up scaling is not enough• The solution must be able to shard metadata into

thousands of smalls pieces (compared to 10s currently supported)

• We can break it down to 64k pieces effectivelyFlash Memory Summit 2017Santa Clara, CA 11

Page 12: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Solving metadata – cont

• Latency increases from coordination

• Each metadata shard takes care of the of the logic of the operation in a lockless manner while requiring minimal help from another networked component• Otherwise, cannot scale

Flash Memory Summit 2017Santa Clara, CA 12

Page 13: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Protocol must change

• NFS over TCP/IP incompatible with low latency

• NFS is stateless and inefficient. “Chatty”

• Local state-full connector with POSIX semantics leveraging NVMe-oF for data transfer

Flash Memory Summit 2017Santa Clara, CA 13

Page 14: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

o Problem: Only way to get fast build was to run software builds on local SSD.

o Pain Point: Wasted engineering time. Lots of data copying back and forth from NAS. Multiple copies of the same data taking up space.

o WekaIO Benefit: – Matrix completed in 38 min vs. 3.8 hours for NFS NAS. – Data never has to be copied. Data is fully sharable. – Massively scalable, sharable and simplified compared

to local disk

o Test Platform – 14 Node HPE Proliant vs. Oracle ZFS

o State-full protocol key for such improvements

WekaIO is 6X Faster than Oracle ZFS

0

50

100

150

200

250

Local Disk SSD WekaIO 3.0 NFS v4

Android BuildTime (minutes)Lower is Better

EDA Software Compilation ResultsActual measured data at a Top 5 semiconductor manufacturer

Page 15: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

User-space is king

• Kernel bypass actually increases performance• No need to depend on Linux kernel for

Networking or the IO stack• Our own memory management allows zero-copy

stack• Own scheduling reduced context-switch latency

Flash Memory Summit 2017Santa Clara, CA 15

Page 16: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Our networking stack

• We have an Ethernet native network stack that is 100% kernel bypass, supports zero-copy operations, and very low latency

• Routable, with retries flow control and congestion control

• We also support NVMe-oF :-)

Flash Memory Summit 2017Santa Clara, CA 16

Page 17: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Locality is irrelevant

• Over 10Gb/e moving 4KB page takes 3.6 µsec• Over 100Gb/e moving 4KB page takes .36 µsec• NAND FLASH media is WAY slower!

• Once locality stopped being importnat, it‘s mucheasier to create distributed algorithms

Flash Memory Summit 2017Santa Clara, CA 17

Page 18: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Software Architecture

§ Runs inside LXC container for isolation

§ SR-IOV to run network stack and NVMe in user space

§ Provides POSIX VFS through lockless queues to WekaIO driver

§ I/O stack bypasses kernel§ Metadata split into many Buckets

– Buckets quickly migrate è no hot spots

§ Support, bare iron, container & hypervisor

Clustering

Balancing

Failure Det.& Comm.

IP Takeover

ApplicationApplicatio

nApplicationApplication

Frontend

SSD Agent

H/W

User Space

KernelWeka DriverTCP/IP Stack

DistributedFS “Bucket”Distributed

FS “Bucket”DistributedFS “Bucket”Distributed

FS “Bucket”Distributed

FS “Bucket”

DataPlacement

FSMetadata

TieringBac

ken

d

Networking

NFS

S3

SMB

HDFS

Page 19: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Our per-core NVMe performance

• Random 4k filesystem reads : 50,175 (less than 500 usec avg latency)• QD=1 read latency 188 usec to application (5237 IOPS)

• Random 4k filesystem writes: 11,320 (less than 700 usec avg latency)• QD=1 write latency 150 usec to application (6658 IOPS)

• Sequential reads: 980 MB/sec• Sequential writes: 370 MB/sec

Flash Memory Summit 2017Santa Clara, CA 19

Page 20: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Perf Scales Linearly with Cluster Size

K

2M

4M

6M

8M

0 30 60 90 120 150 180 210 240

IOPS

Cluster size

Linear Scalability - IOPS

100% random write (IOPS)100% random read (IOPS)

0.2

0.3

0.4

0.5

0.6

0 30 60 90 120 150 180 210 240

Milli

seco

nds

Cluster size

Linear Scalability – Latency (QD1)

read latency (ms) write latency (ms)

0

50

100

0 30 60 90 120 150 180 210 240

GB/

Seco

nd

Cluster size

Linear Scalability -Throughput

100% write throughput (GB)

100% read throughput (GB)

Test Environment – 30-240 R3.8xlarge cluster, 1 AZ, utilizing 2 cores, 2 local SSD drives & 10GB of RAM on each instance. About 5% of CPU/RAM.

~30K OPS/AWS Instance~375MB/sec/AWS Instance<400 microsecond latency

Page 21: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Hyperconverged Mixed environment

o Ideal for customers who have a mixed environment or who have limited capacity and performance needs

o WekaIO and SSD in every storage enabled node

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

Partial Compute w/ WekaIO File System

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

MatrixFS

Matrix Client

DPDK DPDK DPDK DPDK

APP

APP

APP

APP

APP

APP

DPDKDPDK

Page 22: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Dedicated Server Model

o Ideal for customers who do not want to run storage services with compute services

o Requires at least 3 storage servers. The more servers, the quicker the rebuild time

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

Compute w/ WekaIO Front End Client

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

WekaIO File System on Dedicated Servers

MatrixFS

Matrix Client

DPDK DPDK DPDK

Page 23: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Disaggregated JBOF

o Ideal for customers who do not want to run storage services with compute services

o No need for a large number of JBOFs, can start with one

o Each SSD is its own failure domain

Compute w/ Weka

100GE

JBOF

ServerServer

NICNIC

JBOF

ServerServer

NICNIC

JBOF

ServerServer

NICNIC

MatrixFS

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

DPDKDPDKDPDKDPDKDPDKDPDK

NVMeoF or WekaIO Protocol

Page 24: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

What is NVMe-oF?

• Couples NVMe devices with a networked Fabric

• Can be supported in SW or HW accelerated• Several fabrics supported: RDMA based (IB,

RoCE, iWarp), FC , etc• We care about the RDMA based

Flash Memory Summit 2017Santa Clara, CA 24

Page 25: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Zoom into NVMe-oF w/ RDMA

• IB is the best kind of RDMA, just works!

• Ethernet has RoCE support, requires PFC configured• Means inside a TOR works great, very difficult to

get configured across the data center, or even several racks

Flash Memory Summit 2017Santa Clara, CA 25

Page 26: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

But you promised datacenter scale!

• The trick is to use NVMe-oF inside an RDMA domain (or “rack scale”), and then Weka network protocol between racks

Flash Memory Summit 2017Santa Clara, CA 26

Page 27: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

Multi-rack Architecture

Page 28: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

NVMe-oF increasing reliability

• In HC mode, if NVMe device talks directly to the NIC then survives kernel panics, server failures as long as there is power

• Requires HW accelerated NICs that convert DAS NVMe SSD transparently to NVMe-oFconnected device

Flash Memory Summit 2017Santa Clara, CA 28

Page 29: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,

NVMe-oF reducing overhead

• HA appliances are coming soon with a SoCinstead of Intel Processor reducing overall solution price considerably

Flash Memory Summit 2017Santa Clara, CA 29

Page 30: Next generation FS for NVMe - Flash Memory Summit · Flash Memory Summit 2017 Santa Clara, CA 10. Solving metadata ... •Otherwise, cannot scale Flash Memory Summit 2017 Santa Clara,