34
Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System Software Developer Adjunct Professor University of Massachusetts Lowell Aug. 23, 2016

Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

Challenges in Using Persistent Memory In Distributed Storage Systems

Dan LambrightStorage System Software DeveloperAdjunct Professor University of Massachusetts LowellAug. 23, 2016

Page 2: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Storage class memory (SCM)● Distributed storage● GlusterFS, Ceph● Network latency● Accelerating parts of the system with SCM● CPU latency

2

Overview

Page 3: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT 3

● Near DRAM speeds● Better wearability than SSDs● Byte or block addressable (via driver)● Likely to be expensive● Fast random access● Accessible via API (crash-proof transactions)● Bottlenecks move elsewhere within system● Support in Linux

Storage Class MemoryWhat do we know / expect?

Page 4: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

Media Latency Disadvantages

HDD 10ms Slow

SSD 1ms Wears out

SCM < 1us Cost

4

The problemMust lower latencies throughout system : storage, network, CPU

Page 5: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT 5

● Single server (NFS) scales poorly ● Benefits of distributed storage

○ “scale out” to 1000s of nodes ○ Single namespace○ Minimal impact on node failure○ Good fit for commodity hardware

Distributed StorageWhy use it?

Page 6: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Primarily used as a file store ● Combines multiple file systems into a single namespace

Case StudiesGlusterFS

Page 7: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Popular in OpenStack ● Block, object, file● RADOS as intermediate representation

Case StudiesCeph

Page 8: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Plethora of workloads and configurations○ HPC, sequential, random, mixed read/write/transfer size, etc○ # OSDs, nodes, replica/EC sets, ...

● Benchmark one○ (e.g OSD/core)○ Storage is memory /dev/pmem○ Single workload 4K RW - larger transfers see better benefit with RDMA

Framing The ProblemWhat to analyze

Page 9: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

NETWORK LATENCY

Page 10: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● “Primary copy”, update replicas in parallel, processes reads and writes○ Gluster’s forthcoming “JBR”

● “Chain”, forward writes sequentially, updates reads at tail○ (tail sends ACK to client, so fewer messages, more latency)

● Ceph uses “splay” replication, combining parallel updates with reads at tail

ReplicationLatency to copy across nodes

Page 11: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

Client vs Server-side Replication

Client fan-out uses more client side bandwidth; it’s likely client has slower network than server

Server side requires extra hop - adds to latency

Page 12: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Reads following writes○ Only return most recently committed data○ May see bottleneck at tail

● Writes to different objects (but same PG) are serialized○ PG size configurable online○ But, each PG uses resources

Consistency in Ceph

Page 13: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● 2X replication○ If MTBF for SCM is better than rotational storage

● Coalescing operations○ Observed 10% improvement in small file creates on gluster

● Pipelining● Better hardware

○ RDMA (helps larger transfers)○ Increase MTU

Improving Network LatencyTechniques

Page 14: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

ACCELERATION

Page 15: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Tiering● Ceph Filestore journal● Ceph Bluestore Write ahead log● DM cache● XFS journal

Improve Parts of System With SCMHeterogeneous storage

Page 16: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Heterogeneous storage in a single volume● Fast/expensive storage caches slower storage● Introduced in Gluster 3.7● Fast “Hot tier” (e.g. SSD, SCM)● Slow “Cold tier” (e.g. erasure coded)● Cache policies:

○ All data placed on hot tier, until “full”○ Once “full”, data “promoted/demoted” based on access frequency

In Depth: Gluster TieringIllustration of network problem

Page 17: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Tier helped large I/Os, not small● Pattern seen elsewhere ..

○ RDMA performance tests ○ Customer Feedback, overall GlusterFS reputation ...

● Profiles show many LOOKUP round trips ● Conclusion: LOOKUP RTT dominates faster data transfers

○ the problem is exacerbated with SCM

Gluster’s “Small File” Problem Analysis

Page 18: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Each directory in path is tested on open○ Existence tests ○ Permission tests

Understanding LOOKUPs in GlusterProblem : Path Traversal

d1

d2

d3

f1

s1 s2 s3 s4

Page 19: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Client side replication and distributed hash computation● “Layout” definition● Layouts are split across nodes● Each node checked on every Lookup to get the full picture● Must confirm each file up to date

○ File moved○ Node membership changes

Understanding LOOKUPs in GlusterProblem : Coalescing Distributed Hash Ranges

Page 20: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

LOOKUP Amplification

d1

d2

d3

f1

s1 s2 s3 s4

d1/d2/d3Three LOOKUPsFour servers12 LOOKUPs total in worse caseFor a single I/O

Page 21: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Cache Gluster’s per-file metadata at client● Enhancements under development to cache longer● Invalidate cache entry on another client’s change

○ Change to layout● Invalidate intelligently, not spuriously

○ Some attributes may change a lot (crime, atime, ..)

Client Metadata CacheGluster’s md-cache translator

Page 22: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

CPU LATENCY

Page 23: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Data distribution over nodes● Replication + ec over nodes● Single namespace management● Conversion between external and internal representation

CPU problemServices needed to distribute storage add to CPU overhead

Page 24: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Upper (fast) and lower (slow) halves of I/O path● Context switch between halves● Memory allocation matters (Jmalloc)

In Depth : Ceph DatapathAnalysis

Page 25: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

Datapath

Page 26: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● SanDisk○ Sharded work queues○ Bluestore optimizations (shrink metadata, tuning RocksDB)○ Identified TCMalloc problems, introduced JEMalloc○ .. more.. ongoing

● CohortFS (now Red Hat)○ Accelio RDMA module○ Divide and Conquer performance analysis using memstore○ Lockless algorithms / RCU (coming soon)

Community ContributionsSanDisk, CohortFS, others

Page 27: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Motivation○ Transactions difficult to implement with posix○ Ceph journal necessitated double writes○ Object enumeration inefficient

● Why a database ?○ ACID semantics for transactions○ Efficient storage allocation (formally managed by fs)

BluestoreKey-value database as store

Page 28: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Shorter code path helps latency● No longer traverse XFS file system● RocksDB used

Bluestore

Page 29: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

Some resultsCode in flux - YMMV !

Page 30: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

Some resultsCode in flux - YMMV !

Page 31: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Near term improvements○ Sharded extents, not all in one omap (so 4K random reads won't incur large

metadata writes)○ Tune RocksDB compaction options

● Seek alternative to RocksDB ?○ LSM style optimizes for sequential access○ Incurs periodic background compaction, write amplification, ...○ Instead, try SanDisk’s ZetaScale ?

BluestoreHardening performance

Page 32: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

SUMMARY

Page 33: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

RED HAT

● Distributed storage poses unique problems with latency.● Network

○ Reduce round trips by streamlining , coalescing protocol, etc○ Cache at client

● CPU○ Keep shrinking the stack○ Run to completion

● Consider ○ SCM as a tier/cache○ 2x replication

Summary

Page 34: Distributed Storage Systems Challenges in Using Persistent … · 2019-12-21 · Challenges in Using Persistent Memory In Distributed Storage Systems Dan Lambright Storage System

THANK YOU

plus.google.com/+RedHat

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHatNews