ERASURE CODING AND CACHE TIERING · Within each OSD – Combine SSD and HDD under each OSD – Make localized promote/demote decisions – Leverage existing tools dm-cache, bcache,

ERASURE CODING AND CACHE TIERING

SAMUEL JUST – 2015 VAULT

2

AGENDA

● What is Ceph?

● What is RADOS and what uses it?

● How does Cache tiering fit in?

● How does Erasure coding fit in?

● What's next?

ARCHITECTURE

4

CEPH MOTIVATING PRINCIPLES

● All components must scale horizontally

● There can be no single point of failure

● The solution must be hardware agnostic

● Should use commodity hardware

● Self-manage whenever possible

● Open source (LGPL)

5

CEPH COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

ROBUST SERVICES BUILT ON RADOS

7

ARCHITECTURAL COMPONENTS

RGWA web services


with S3 and Swift








APP HOST/VM CLIENT

8

THE RADOS GATEWAY

M M

M

RADOS CLUSTER

RADOSGWLIBRADOS

socket

RADOSGWLIBRADOS

APPLICATION APPLICATION

REST

9


RGWA web services


with S3 and Swift








APP HOST/VM CLIENT

10

STORING VIRTUAL DISKS

M M

RADOS CLUSTER

HYPERVISORLIBRBD

VM

11


RGWA web services


with S3 and Swift








APP HOST/VM CLIENT

RADOS

13


RGWA web services


with S3 and Swift








APP HOST/VM CLIENT

14

RADOS

● Flat object namespace within each pool

● Strong consistency (CP system)

● Infrastructure aware, dynamic topology

● Hash-based placement (CRUSH)

● Direct client to server data path

15

LIBRADOS Interface

● Rich object API

– Bytes, attributes, key/value data

– Partial overwrite of existing data

– Single-object compound atomic operations

– RADOS classes (stored procedures)

16

RADOS COMPONENTS

OSDs: 10s to 1000s in a cluster One per disk (or one per SSD, RAID

group…) Serve stored objects to clients Intelligently peer for replication & recovery

17

OBJECT STORAGE DAEMONS

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

xfsbtrfsext4

M

M

M

18

RADOS COMPONENTS

Monitors: Maintain cluster membership and state Provide consensus for distributed decision-

making Small, odd number (e.g., 5) Not part of data path

M

DATA PLACEMENT

20

WHERE DO OBJECTS LIVE?

??APPLICATION

M

M

M

OBJECT

21

A METADATA SERVER?

1

APPLICATION

M

M

M

2

22

CALCULATED PLACEMENT

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z

23

CRUSH

CLUSTER

OBJECTS

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

PLACEMENT GROUPS(PGs)

24

CRUSH IS A QUICK CALCULATION

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

25

CRUSH IS A QUICK CALCULATION

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

26

CRUSH AVOIDS FAILED DEVICES

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

10

27

CRUSH: DECLUSTERED PLACEMENT

RADOS CLUSTER

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

27● Each PG independently maps to a

pseudorandom set of OSDs

● PGs that map to the same OSD generally have replicas that do not

● When an OSD fails, each PG it stored will generally be re-replicated by a different OSD

– Highly parallel recovery

28

CRUSH: DYNAMIC DATA PLACEMENT

CRUSH: Pseudo-random placement algorithm

Fast calculation, no lookup Repeatable, deterministic

Statistically uniform distribution Stable mapping

Limited data migration on change Rule-based configuration

Infrastructure topology aware Adjustable replication Weighting

29

DATA IS ORGANIZED INTO POOLS

CLUSTER

OBJECTS

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

POOLS(CONTAINING PGs)

10

01

11

01

10

01

01

10

01

10

10

01

11

01

10

01

10 01 10 11

01

11

01

10

10

01

01

01

10

10

01

01

POOLA

POOLB

POOL C

POOLDOBJECTS

OBJECTS

OBJECTS

30

Peering

● Each OSDMap is numbered with an epoch number

● The Monitors and OSDs store a history of OSDMaps

● Using this history, an OSD which becomes a new member of a PG can deduce every OSD which could have received a write which it needs to know about

● The process of discovering the authoritative state of the objects stored in the PG by contacting old PG members is called Peering

31

Peering

11Epoch 20220:

32

Peering

11Epoch 20220: 305

33

Peering

11Epoch 20220: 305

Epoch 20113: 305

11Epoch 19884: 305

30

34

Recovery, Backfill, and PG Temp

● The state of a PG on an OSD is represented by a PG Log which contains the most recent operations witnessed by that OSD.

● The authoritative state of a PG after Peering is represented by constructing an authoritative PG Log from an up-to-date peer.

● If a peer's PG Log overlaps the authoritative PG Log, we can construct a set of out-of-date objects to recover

● Otherwise, we do not know which objects are out-of-date, and we must perform Backfill

35

Recovery, Backfill, and PG Temp

● If we do not know which objects are invalid, we certainly cannot serve reads from such a peer

● If after peering the primary determines that it or any other peer requires Backfill, it will request that the Monitor cluster publish a new map with an exception to the CRUSH mapping for this PG mapping it to the best set of up-to-date peers that it can find.

● Once that map is published, peering will happen again, and the up-to-date peers will independently conclude that they should serve reads and writes while concurrently backfilling the correct peers

36

Backfill Example

0Epoch 1130: 21

37

Backfill Example

0Epoch 1130: 21

3Epoch 1340: 54

38

Backfill Example

0Epoch 1130: 21

3Epoch 1340: 54

0Epoch 1345: 21

39

Backfill Example

0Epoch 1130: 21

3Epoch 1340: 54

0Epoch 1345: 21

3Epoch 1391: 54

40

Backfill

● Backfill in object order (basically hash order) within the pg

● info.last_backfill

● Obj o <= info.last_backfill → object is up to date

● Obj o > info.last_backfill → object is not up to date

TIERED STORAGE

42

TWO WAYS TO CACHE

● Within each OSD

– Combine SSD and HDD under each OSD

– Make localized promote/demote decisions

– Leverage existing tools

● dm-cache, bcache, FlashCache● Variety of caching controllers

– We can help with hints

● Cache on separate devices/nodes

– Different hardware for different tiers

● Slow nodes for cold data● High performance nodes for hot data

– Add, remove, scale each tier independently

● Unlikely to choose right ratios at procurement time

HDD

OSD

BLOCKDEV

SSD

FS

43

TIERED STORAGE

APPLICATION

CACHE POOL (REPLICATED)

BACKING POOL (ERASURE CODED)

CEPH STORAGE CLUSTER

44

RADOS TIERING PRINCIPLES

● Each tier is a RADOS pool

– May be replicated or erasure coded

● Tiers are durable

– e.g., replicate across SSDs in multiple hosts

● Each tier has its own CRUSH policy

– e.g., map cache pool to SSDs devices/hosts only

● librados adapts to tiering topology

– Transparently direct requests accordingly

● e.g., to cache

– No changes to RBD, RGW, CephFS, etc.

45

WRITE INTO CACHE POOL

CEPH CLIENT

CACHE POOL (SSD): WRITEBACK

BACKING POOL (HDD)


WRITE ACK

46

WRITE (MISS)

CEPH CLIENT


BACKING POOL (HDD)


WRITE ACK

PROMOTE

47

WRITE (MISS) (COMING SOON)

CEPH CLIENT


BACKING POOL (HDD)


WRITE ACK

PROXY WRITE

48

READ (CACHE HIT)

CEPH CLIENT


BACKING POOL (HDD)


READ READ REPLY

49

READ (CACHE MISS)

CEPH CLIENT


BACKING POOL (HDD)


READ READ REPLYREDIRECT READ

50

READ (CACHE MISS)

CEPH CLIENT


BACKING POOL (HDD)


READ READ REPLY

PROXY READ

51

READ (CACHE MISS)

CEPH CLIENT


BACKING POOL (HDD)


READ READ REPLY

PROXY READ PROMOTE

52

CACHE TIERING ARCHITECTURE

● Cache and backing pools are otherwise independent rados pools with independent placement rules

● OSDs in the cache pool are able to handle promotion and eviction of their objects independently allowing for scalability.

● RADOS clients understand the tiering configuration and are able to send requests to the right place mostly without redirects.

● Librados users can perform operations on a cache pool transparently trusting the library to correctly handle routing requests between the cache pool and base pool as needed

53

ESTIMATING TEMPERATURE

● Each PG constructs in-memory bloom filters

– Insert records on both read and write

– Each filter covers configurable period (e.g., 1 hour)

– Tunable false positive probability (e.g., 5%)

– Maintain most recent N filters on disk

● Estimate temperature

– Has object been accessed in any of the last N periods?

– ...in how many of them?

– Informs flush/evict decision

● Estimate “recency”

– How many periods since the object hasn't been accessed?

– Informs read miss behavior: promote vs redirect

54

FLUSH AND/OR EVICTCOLD DATA

CEPH CLIENT


BACKING POOL (HDD)


FLUSH ACK EVICT

55

TIERING AGENT

● Each PG has an internal tiering agent

– Manages PG based on administrator defined policy

● Flush dirty objects

– When pool reaches target dirty ratio

– Tries to select cold objects

– Marks objects clean when they have been written back to the base pool

● Evict clean objects

– Greater “effort” as pool/PG size approaches target size

56

CACHE TIER USAGE

● Cache tier should be faster than the base tier

● Cache tier should be replicated (not erasure coded)

● Promote and flush are expensive

– Best results when object temperature are skewed

● Most I/O goes to small number of hot objects

– Cache should be big enough to capture most of the acting set

● Challenging to benchmark

– Need a realistic workload (e.g., not 'dd') to determine how it will perform in practice

– Takes a long time to “warm up” the cache

ERASURE CODING

58

ERASURE CODING

OBJECT

REPLICATED POOL


ERASURE CODED POOL


COPY COPY

OBJECT

31 2 X Y

COPY4

Full copies of stored objects Very high durability 3x (200% overhead) Quicker recovery

One copy plus parity Cost-effective durability 1.5x (50% overhead) Expensive recovery

59

ERASURE CODING SHARDS


OBJECT

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

60

ERASURE CODING SHARDS


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

0

4

8

12

16

1

5

9

13

17

2

6

10

14

18

3

7

9

15

19

A

B

C

D

E

A'

B'

C'

D'

E'

● Variable stripe size (e.g., 4 KB)

● Zero-fill shards (logically) in partial tail stripe

61

PRIMARY


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

62

EC READ


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

READ

63

EC READ


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

READ

READS

64

EC READ


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

READ REPLY

65

EC WRITE


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

66

EC WRITE


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

WRITES

67

EC WRITE


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE ACK

68

EC WRITE: DEGRADED


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

WRITES

69

EC WRITE: PARTIAL FAILURE


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

WRITE

WRITES

70



Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

B B BA A A

71

EC RESTRICTIONS

● Overwrite in place will not work in general

● Log and 2PC would increase complexity, latency

● We chose to restrict allowed operations

– create

– append (on stripe boundary)

– remove (keep previous generation of object for some time)

● These operations can all easily be rolled back locally

– create → delete

– append → truncate

– remove → roll back to previous generation

● Object attrs preserved in existing PG logs (they are small)

● Key/value data is not allowed on EC pools

72



Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

B B BA A A

73



Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CEPH CLIENT

A A AA A A

74

EC RESTRICTIONS

● This is a small subset of allowed librados operations

– Notably cannot (over)write any extent

● Coincidentally, these operations are also inefficient for erasure codes

– Generally require read/modify/write of affected stripe(s)

● Some applications can consume EC directly

– RGW (no object data update in place)

● Others can combine EC with a cache tier (RBD, CephFS)

– Replication for warm/hot data

– Erasure coding for cold data

– Tiering agent skips objects with key/value data

75

WHICH ERASURE CODE?

● The EC algorithm and implementation are pluggable

– jerasure (free, open, and very fast)

– ISA-L (Intel library; optimized for modern Intel procs)

– LRC (local recovery code – layers over existing plugins)

● Parameterized

– Pick k or m, stripe size

● OSD handles data path, placement, rollback, etc.

● Plugin handles

– Encode and decode

– Given these available shards, which ones should I fetch to satisfy a read?

– Given these available shards and these missing shards, which ones should I fetch to recover?

76

COST OF RECOVERY

1 TB OSD

77

COST OF RECOVERY

1 TB OSD

78

COST OF RECOVERY (REPLICATION)

1 TB OSD

1 TB

79


1 TB OSD

.01 TB

.01 TB

.01 TB

.01 TB

...

...

.01 TB .01 TB

80


1 TB OSD

1 TB

81

COST OF RECOVERY (EC)

1 TB OSD

1 TB

1 TB

1 TB

1 TB

82

LOCAL RECOVERY CODE (LRC)


Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

A

OSD

C

OSD

B

OSD

OBJECT

83

BIG THANKS TO

● Ceph

– Loic Dachary (CloudWatt, FSF France, Red Hat)

– Andreas Peters (CERN)

– David Zafman (Inktank / Red Hat)

● jerasure / gf-complete

– Jim Plank (University of Tennessee)

– Kevin Greenan (Box.com)

● Intel (ISL plugin)

● Fujitsu (SHEC plugin)

ROADMAP

85

WHAT'S NEXT

● Erasure coding

– Allow (optimistic) client reads directly from shards

– ARM optimizations for jerasure

● Cache pools

– Better agent decisions (when to flush or evict)

– Supporting different performance profiles

● e.g., slow / “cheap” flash can read just as fast

– Complex topologies

● Multiple readonly cache tiers in multiple sites

● Tiering

– Support “redirects” to (very) cold tier below base pool

– Enable dynamic spin-down, dedup, and other features

86

OTHER ONGOING WORK

● Performance optimization (SanDisk, Intel, Mellanox)

● Alternative OSD backends

– New backend: hybrid key/value and file system

– leveldb, rocksdb, LMDB

● Messenger (network layer) improvements

– RDMA support (libxio – Mellanox)

– Event-driven TCP implementation (UnitedStack)

87

FOR MORE INFORMATION

● http://ceph.com

● http://github.com/ceph

● http://tracker.ceph.com

● Mailing lists

– [email protected]

– [email protected]

● irc.oftc.net

– #ceph

– #ceph-devel

● Twitter

– @ceph

THANK YOU!

Samuel Just

[email protected]

Documents

ERASURE CODING AND CACHE TIERING · Within each OSD – Combine SSD and HDD under each OSD – Make localized promote/demote decisions – Leverage existing tools dm-cache, bcache,