76
ERASURE CODING AND CACHE TIERING SAGE WEIL - SDC 2014.09.16

ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

Embed Size (px)

Citation preview

Page 1: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

ERASURE CODING AND CACHE TIERING

SAGE WEIL - SDC 2014.09.16

Page 2: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

ARCHITECTURE

Page 3: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

3

CEPH MOTIVATING PRINCIPLES

● All components must scale horizontally

● There can be no single point of failure

● The solution must be hardware agnostic

● Should use commodity hardware

● Self-manage whenever possible

● Open source

Page 4: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

4

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 5: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

ROBUST SERVICES BUILT ON RADOS

Page 6: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

6

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 7: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

7

THE RADOS GATEWAY

M M

M

RADOS CLUSTER

RADOSGWLIBRADOS

socket

RADOSGWLIBRADOS

APPLICATION APPLICATION

REST

Page 8: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

8

MULTI-SITE OBJECT STORAGE

WEB APPLICATION

APP SERVER

CEPH OBJECT GATEWAY

(RGW)

CEPH STORAGE CLUSTER

(US-EAST)

WEB APPLICATION

APP SERVER

CEPH OBJECT GATEWAY

(RGW)

CEPH STORAGE CLUSTER

(EU-WEST)

Page 9: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

10

RADOSGW MAKES RADOS WEBBY

RADOSGW: REST-based object storage proxy Uses RADOS to store objects

● Stripes large RESTful objects across many RADOS objects

API supports buckets, accounts Usage accounting for billing Compatible with S3 and Swift applications

Page 10: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

11

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 11: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

12

STORING VIRTUAL DISKS

M M

RADOS CLUSTER

HYPERVISORLIBRBD

VM

Page 12: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

13

KERNEL MODULE

M M

RADOS CLUSTER

LINUX HOSTKRBD

Page 13: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

14

RBD FEATURES

● Stripe images across entire cluster (pool)

● Read-only snapshots

● Copy-on-write clones

● Broad integration

– Qemu

– Linux kernel

– iSCSI (STGT, LIO)

– OpenStack, CloudStack, Nebula, Ganeti, Proxmox

● Incremental backup (relative to snapshots)

Page 14: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

15

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 15: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

16

SEPARATE METADATA SERVER

LINUX HOST

M M

M

RADOS CLUSTER

KERNEL MODULE

datametadata 0110

Page 16: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

17

SCALABLE METADATA SERVERS

METADATA SERVER Manages metadata for a POSIX-compliant

shared filesystem Directory hierarchy File metadata (owner, timestamps,

mode, etc.) Clients stripe file data in RADOS

MDS not in data path MDS stores metadata in RADOS

Key/value objects Dynamic cluster scales to 10s or 100s Only required for shared filesystem

Page 17: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

RADOS

Page 18: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

19

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 19: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

20

RADOS

● Flat object namespace within each pool

● Rich object API (librados)

– Bytes, attributes, key/value data

– Partial overwrite of existing data

– Single-object compound operations

– RADOS classes (stored procedures)

● Strong consistency (CP system)

● Infrastructure aware, dynamic topology

● Hash-based placement (CRUSH)

● Direct client to server data path

Page 20: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

21

RADOS CLUSTER

APPLICATION

M M

M M

M

RADOS CLUSTER

Page 21: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

22

RADOS COMPONENTS

OSDs: 10s to 1000s in a cluster One per disk (or one per SSD, RAID

group…) Serve stored objects to clients Intelligently peer for replication & recovery

Monitors: Maintain cluster membership and state Provide consensus for distributed decision-

making Small, odd number (e.g., 5) Not part of data path

M

Page 22: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

23

OBJECT STORAGE DAEMONS

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

xfsbtrfsext4

M

M

M

Page 23: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

DATA PLACEMENT

Page 24: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

25

WHERE DO OBJECTS LIVE?

??APPLICATION

M

M

M

OBJECT

Page 25: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

26

A METADATA SERVER?

1

APPLICATION

M

M

M

2

Page 26: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

27

CALCULATED PLACEMENT

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z

Page 27: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

28

CRUSH

CLUSTER

OBJECTS

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

PLACEMENT GROUPS(PGs)

Page 28: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

29

CRUSH IS A QUICK CALCULATION

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

Page 29: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

30

CRUSH AVOIDS FAILED DEVICES

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

10

Page 30: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

31

CRUSH: DECLUSTERED PLACEMENT

RADOS CLUSTER

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

31● Each PG independently maps to a

pseudorandom set of OSDs

● PGs that map to the same OSD generally have replicas that do not

● When an OSD fails, each PG it stored will generally be re-replicated by a different OSD

– Highly parallel recovery

Page 31: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

32

CRUSH: DYNAMIC DATA PLACEMENT

CRUSH: Pseudo-random placement algorithm

Fast calculation, no lookup Repeatable, deterministic

Statistically uniform distribution Stable mapping

Limited data migration on change Rule-based configuration

Infrastructure topology aware Adjustable replication Weighting

Page 32: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

33

DATA IS ORGANIZED INTO POOLS

CLUSTER

OBJECTS

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

POOLS(CONTAINING PGs)

10

01

11

01

10

01

01

10

01

10

10

01

11

01

10

01

10 01 10 11

01

11

01

10

10

01

01

01

10

10

01

01

POOLA

POOLB

POOL C

POOLDOBJECTS

OBJECTS

OBJECTS

Page 33: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

TIERED STORAGE

Page 34: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

35

TWO WAYS TO CACHE

● Within each OSD

– Combine SSD and HDD for each OSD

– Make localized promote/demote decisions

– Leverage existing tools

● dm-cache, bcache, FlashCache● Variety of caching controllers

– We can help with hints

● Cache on separate devices/nodes

– Different hardware for different tiers

● Slow nodes for cold data● High performance nodes for hot data

– Add, remove, scale each tier independently

● Unlikely to choose right ratios at procurement time

Page 35: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

36

TIERED STORAGE

APPLICATION

CACHE POOL (REPLICATED)

BACKING POOL (ERASURE CODED)

CEPH STORAGE CLUSTER

Page 36: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

37

RADOS TIERING PRINCIPLES

● Each tier is a RADOS pool

– May be replicated or erasure coded

● Tiers are durable

– e.g., replicate across SSDs in multiple hosts

● Each tier has its own CRUSH policy

– e.g., map to SSDs devices/hosts only

● librados clients adapt to tiering topology

– Transparently direct requests accordingly

● e.g., to cache

– No changes to RBD, RGW, CephFS, etc.

Page 37: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

38

WRITE INTO CACHE POOL

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

WRITE ACK

Page 38: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

39

WRITE INTO CACHE POOL

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

WRITE ACK

PROMOTE

Page 39: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

40

READ (CACHE HIT)

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

READ READ REPLY

Page 40: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

41

READ (CACHE MISS)

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

READ READ REPLYREDIRECT READ

Page 41: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

42

READ (CACHE MISS)

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

READ

PROMOTE

READ REPLY

Page 42: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

43

ESTIMATING TEMPERATURE

● Each PG constructs in-memory bloom filters

– Insert records on both read and write

– Each filter covers configurable period (e.g., 1 hour)

– Tunable false positive probability (e.g., 5%)

– Maintain most recent N filters on disk

● Estimate temperature

– Has object been accessed in any of the last N periods?

– ...in how many of them?

– Informs flush/evict decision

● Estimate “recency”

– How many periods since the object hasn't been accessed?

– Informs read miss behavior: promote vs redirect

Page 43: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

44

AGENT: FLUSH COLD DATA

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)

CEPH STORAGE CLUSTER

FLUSH ACK

Page 44: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

45

TIERING AGENT

● Each PG has an internal tiering agent

– Manages PG based on administrator defined policy

● Flush dirty objects

– When pool reaches target dirty ratio

– Tries to select cold objects

– Marks objects clean when they have been written back to the base pool

● Evict clean objects

– Greater “effort” as pool/PG size approaches target size

Page 45: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

46

READ ONLY CACHE TIER

CEPH CLIENT

CACHE POOL (SSD): READ ONLY

BACKING POOL (REPLICATED)

CEPH STORAGE CLUSTER

READ READ REPLY

PROMOTE

WRITE ACK

Page 46: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

ERASURE CODING

Page 47: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

48

ERASURE CODING

OBJECT

REPLICATED POOL

CEPH STORAGE CLUSTER

ERASURE CODED POOL

CEPH STORAGE CLUSTER

COPY COPY

OBJECT

31 2 X Y

COPY4

Full copies of stored objects Very high durability 3x (200% overhead) Quicker recovery

One copy plus parity Cost-effective durability 1.5x (50% overhead) Expensive recovery

Page 48: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

49

ERASURE CODING SHARDS

CEPH STORAGE CLUSTER

OBJECT

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

Page 49: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

50

ERASURE CODING SHARDS

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

0

4

8

12

16

1

5

9

13

17

2

6

10

14

18

3

7

9

15

19

A

B

C

D

E

A'

B'

C'

D'

E'

● Variable stripe size

● Zero-fill shards (logically) in partial tail stripe

Page 50: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

51

PRIMARY

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

Page 51: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

52

EC READ

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

READ

Page 52: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

53

EC READ

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

READ

READS

Page 53: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

54

EC READ

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

READ REPLY

Page 54: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

55

EC WRITE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

Page 55: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

56

EC WRITE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

WRITES

Page 56: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

57

EC WRITE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE ACK

Page 57: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

58

EC WRITE: DEGRADED

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

WRITES

Page 58: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

59

EC WRITE: PARTIAL FAILURE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

WRITES

Page 59: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

60

EC WRITE: PARTIAL FAILURE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

B B BA A A

Page 60: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

61

EC RESTRICTIONS

● Overwrite in place will not work in general

● Log and 2PC would increase complexity, latency

● We chose to restrict allowed operations

– create

– append (on stripe boundary)

– remove (keep previous generation of object for some time)

● These operations can all easily be rolled back locally

– create → delete

– append → truncate

– remove → roll back to previous generation

● Object attrs preserved in existing PG logs (they are small)

● Key/value data is not allowed on EC pools

Page 61: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

62

EC WRITE: PARTIAL FAILURE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

B B BA A A

Page 62: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

63

EC WRITE: PARTIAL FAILURE

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

A A AA A A

Page 63: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

64

EC RESTRICTIONS

● This is a small subset of allowed librados operations

– Notably cannot (over)write any extent

● Coincidentally, these operations are also inefficient for erasure codes

– Generally require read/modify/write of affected stripe(s)

● Some applications can consume EC directly

– RGW (no object data update in place)

● Others can combine EC with a cache tier (RBD, CephFS)

– Replication for warm/hot data

– Erasure coding for cold data

– Tiering agent skips objects with key/value data

Page 64: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

65

WHICH ERASURE CODE?

● The EC algorithm and implementation are pluggable

– jerasure (free, open, and very fast)

– ISA-L (Intel library; optimized for modern Intel procs)

– LRC (local recovery code – layers over existing plugins)

● Parameterized

– Pick k or m, stripe size

● OSD handles data path, placement, rollback, etc.

● Plugin handles

– Encode and decode

– Given these available shards, which ones should I fetch to satisfy a read?

– Given these available shards and these missing shards, which ones should I fetch to recover?

Page 65: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

66

COST OF RECOVERY

1 TB OSD

Page 66: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

67

COST OF RECOVERY

1 TB OSD

Page 67: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

68

COST OF RECOVERY (REPLICATION)

1 TB OSD

1 TB

Page 68: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

69

COST OF RECOVERY (REPLICATION)

1 TB OSD

.01 TB

.01 TB

.01 TB

.01 TB

...

...

.01 TB .01 TB

Page 69: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

70

COST OF RECOVERY (REPLICATION)

1 TB OSD

1 TB

Page 70: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

71

COST OF RECOVERY (EC)

1 TB OSD

1 TB

1 TB

1 TB

1 TB

Page 71: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

72

LOCAL RECOVERY CODE (LRC)

CEPH STORAGE CLUSTER

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

A

OSD

C

OSD

B

OSD

OBJECT

Page 72: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

73

BIG THANKS TO

● Ceph

– Loic Dachary (CloudWatt, FSF France, Red Hat)

– Andreas Peters (CERN)

– Sam Just (Inktank / Red Hat)

– David Zafman (Inktank / Red Hat)

● jerasure

– Jim Plank (University of Tennessee)

– Kevin Greenan (Box)

Page 73: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

ROADMAP

Page 74: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

75

WHAT'S NEXT

● Erasure coding

– Allow (optimistic) client reads directly from shards

– ARM optimizations for jerasure

● Cache pools

– Better agent decisions (when to flush or evict)

– Supporting different performance profiles

● e.g., slow / “cheap” flash can read just as fast

– Complex topologies

● Multiple readonly cache tiers in multiple sites

● Tiering

– Support “redirects” to cold tier below base pool

– Dynamic spin-down

Page 75: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

76

OTHER ONGOING WORK

● Performance optimization (SanDisk, Mellanox)

● Alternative OSD backends

– leveldb, rocksdb, LMDB

– hybrid key/value and file system

● Messenger (network layer) improvements

– RDMA support (libxio – Mellanox)

– Event-driven TCP implementation (UnitedStack)

● Multi-datacenter RADOS replication

● CephFS

– Online consistency checking

– Performance, robustness

Page 76: ERASURE CODING AND CACHE TIERING - SNIA · – OpenStack, CloudStack, Nebula, Ganeti, Proxmox ... ERASURE CODING OBJECT REPLICATED POOL CEPH STORAGE CLUSTER ERASURE CODED POOL CEPH

THANK YOU!

Sage WeilCEPH PRINCIPAL ARCHITECT

[email protected]

@liewegas