40
The safe way to make Ceph storage enterprise ready! Build your own [disaster] ? Copyright 2015 FUJITSU Paul von Stamwitz Sr. Storage Architect Storage Planning, R&D Center 2015-07-16

Ceph Day LA: Building your own disaster? The safe way to make Ceph storage ready!

Embed Size (px)

Citation preview

The safe way to make Ceph storage enterprise ready

Build your own [disaster]

Copyright 2015 FUJITSU

Paul von Stamwitz

Sr Storage Architect

Storage Planning RampD Center

2015-07-16

1

The safe and convenient way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

2

Building Storage with Ceph looks simple

Copyright 2015 FUJITSU

Ceph

+ some servers

+ network

= storage

3

Building Storage with Ceph looks simple ndash buthelliphellip

Many new Complexities

Rightsizing server disk types network

bandwidth

Silos of management tools (HW SW)

Keeping Ceph versions with versions of

server HW OS connectivity drivers in sync

Management of maintenance and support

contracts of components

Troubleshooting

Copyright 2015 FUJITSU

Build Ceph source storage yourself

4

The challenges of software defined storage

What users want

Open standards

High scalability

High reliability

Lower costs

No-lock in from a vendor

What users may get

An own developed storage system based on open

industry standard HW amp SW components

High scalability and reliability If the stack works

Lower investments but higher operational efforts

Lock-in into the own stack

Copyright 2015 FUJITSU

5

ETERNUS CD10000 ndash Making Ceph enterprise ready

Build Ceph source storage yourself Out of the box ETERNUS CD10000

incl support

incl maintenance

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

E2E Solution Contract by Fujitsu based on Red Hat Ceph Enterprise

Easy Deployment Management by Fujitsu

+

+

+ Lifecycle Management for Hardware amp Software by Fujitsu

+

6

Fujitsu Maintenance Support and Professional Services

ETERNUS CD10000 A complete offer

Copyright 2015 FUJITSU

7

Unlimited Scalability

Cluster of storage nodes

Capacity and performance scales by

adding storage nodes

Three different node types enable

differentiated service levels

Density capacity optimized

Performance optimized

Optimized for small scale dev amp test

1st version of CD10000 (Q32014) is

released for a range of 4 to 224 nodes

Scales up to gt50 Petabyte

Copyright 2015 FUJITSU

Basic node 12 TB Performance node 35 TB Capacity node 252 TB

8

Immortal System

Copyright 2015 FUJITSU

Node1 Node2 Node(n)

+

Adding nodes

with new generation

of hardware

helliphelliphellip +

Adding nodes

New Node1

Non-disruptive add remove exchange of hardware (disks and nodes)

Mix of nodesdisks of different generations online technology refresh

Very long lifecycle reduces migration efforts and costs

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

1

The safe and convenient way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

2

Building Storage with Ceph looks simple

Copyright 2015 FUJITSU

Ceph

+ some servers

+ network

= storage

3

Building Storage with Ceph looks simple ndash buthelliphellip

Many new Complexities

Rightsizing server disk types network

bandwidth

Silos of management tools (HW SW)

Keeping Ceph versions with versions of

server HW OS connectivity drivers in sync

Management of maintenance and support

contracts of components

Troubleshooting

Copyright 2015 FUJITSU

Build Ceph source storage yourself

4

The challenges of software defined storage

What users want

Open standards

High scalability

High reliability

Lower costs

No-lock in from a vendor

What users may get

An own developed storage system based on open

industry standard HW amp SW components

High scalability and reliability If the stack works

Lower investments but higher operational efforts

Lock-in into the own stack

Copyright 2015 FUJITSU

5

ETERNUS CD10000 ndash Making Ceph enterprise ready

Build Ceph source storage yourself Out of the box ETERNUS CD10000

incl support

incl maintenance

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

E2E Solution Contract by Fujitsu based on Red Hat Ceph Enterprise

Easy Deployment Management by Fujitsu

+

+

+ Lifecycle Management for Hardware amp Software by Fujitsu

+

6

Fujitsu Maintenance Support and Professional Services

ETERNUS CD10000 A complete offer

Copyright 2015 FUJITSU

7

Unlimited Scalability

Cluster of storage nodes

Capacity and performance scales by

adding storage nodes

Three different node types enable

differentiated service levels

Density capacity optimized

Performance optimized

Optimized for small scale dev amp test

1st version of CD10000 (Q32014) is

released for a range of 4 to 224 nodes

Scales up to gt50 Petabyte

Copyright 2015 FUJITSU

Basic node 12 TB Performance node 35 TB Capacity node 252 TB

8

Immortal System

Copyright 2015 FUJITSU

Node1 Node2 Node(n)

+

Adding nodes

with new generation

of hardware

helliphelliphellip +

Adding nodes

New Node1

Non-disruptive add remove exchange of hardware (disks and nodes)

Mix of nodesdisks of different generations online technology refresh

Very long lifecycle reduces migration efforts and costs

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

2

Building Storage with Ceph looks simple

Copyright 2015 FUJITSU

Ceph

+ some servers

+ network

= storage

3

Building Storage with Ceph looks simple ndash buthelliphellip

Many new Complexities

Rightsizing server disk types network

bandwidth

Silos of management tools (HW SW)

Keeping Ceph versions with versions of

server HW OS connectivity drivers in sync

Management of maintenance and support

contracts of components

Troubleshooting

Copyright 2015 FUJITSU

Build Ceph source storage yourself

4

The challenges of software defined storage

What users want

Open standards

High scalability

High reliability

Lower costs

No-lock in from a vendor

What users may get

An own developed storage system based on open

industry standard HW amp SW components

High scalability and reliability If the stack works

Lower investments but higher operational efforts

Lock-in into the own stack

Copyright 2015 FUJITSU

5

ETERNUS CD10000 ndash Making Ceph enterprise ready

Build Ceph source storage yourself Out of the box ETERNUS CD10000

incl support

incl maintenance

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

E2E Solution Contract by Fujitsu based on Red Hat Ceph Enterprise

Easy Deployment Management by Fujitsu

+

+

+ Lifecycle Management for Hardware amp Software by Fujitsu

+

6

Fujitsu Maintenance Support and Professional Services

ETERNUS CD10000 A complete offer

Copyright 2015 FUJITSU

7

Unlimited Scalability

Cluster of storage nodes

Capacity and performance scales by

adding storage nodes

Three different node types enable

differentiated service levels

Density capacity optimized

Performance optimized

Optimized for small scale dev amp test

1st version of CD10000 (Q32014) is

released for a range of 4 to 224 nodes

Scales up to gt50 Petabyte

Copyright 2015 FUJITSU

Basic node 12 TB Performance node 35 TB Capacity node 252 TB

8

Immortal System

Copyright 2015 FUJITSU

Node1 Node2 Node(n)

+

Adding nodes

with new generation

of hardware

helliphelliphellip +

Adding nodes

New Node1

Non-disruptive add remove exchange of hardware (disks and nodes)

Mix of nodesdisks of different generations online technology refresh

Very long lifecycle reduces migration efforts and costs

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

3

Building Storage with Ceph looks simple ndash buthelliphellip

Many new Complexities

Rightsizing server disk types network

bandwidth

Silos of management tools (HW SW)

Keeping Ceph versions with versions of

server HW OS connectivity drivers in sync

Management of maintenance and support

contracts of components

Troubleshooting

Copyright 2015 FUJITSU

Build Ceph source storage yourself

4

The challenges of software defined storage

What users want

Open standards

High scalability

High reliability

Lower costs

No-lock in from a vendor

What users may get

An own developed storage system based on open

industry standard HW amp SW components

High scalability and reliability If the stack works

Lower investments but higher operational efforts

Lock-in into the own stack

Copyright 2015 FUJITSU

5

ETERNUS CD10000 ndash Making Ceph enterprise ready

Build Ceph source storage yourself Out of the box ETERNUS CD10000

incl support

incl maintenance

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

E2E Solution Contract by Fujitsu based on Red Hat Ceph Enterprise

Easy Deployment Management by Fujitsu

+

+

+ Lifecycle Management for Hardware amp Software by Fujitsu

+

6

Fujitsu Maintenance Support and Professional Services

ETERNUS CD10000 A complete offer

Copyright 2015 FUJITSU

7

Unlimited Scalability

Cluster of storage nodes

Capacity and performance scales by

adding storage nodes

Three different node types enable

differentiated service levels

Density capacity optimized

Performance optimized

Optimized for small scale dev amp test

1st version of CD10000 (Q32014) is

released for a range of 4 to 224 nodes

Scales up to gt50 Petabyte

Copyright 2015 FUJITSU

Basic node 12 TB Performance node 35 TB Capacity node 252 TB

8

Immortal System

Copyright 2015 FUJITSU

Node1 Node2 Node(n)

+

Adding nodes

with new generation

of hardware

helliphelliphellip +

Adding nodes

New Node1

Non-disruptive add remove exchange of hardware (disks and nodes)

Mix of nodesdisks of different generations online technology refresh

Very long lifecycle reduces migration efforts and costs

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

4

The challenges of software defined storage

What users want

Open standards

High scalability

High reliability

Lower costs

No-lock in from a vendor

What users may get

An own developed storage system based on open

industry standard HW amp SW components

High scalability and reliability If the stack works

Lower investments but higher operational efforts

Lock-in into the own stack

Copyright 2015 FUJITSU

5

ETERNUS CD10000 ndash Making Ceph enterprise ready

Build Ceph source storage yourself Out of the box ETERNUS CD10000

incl support

incl maintenance

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

E2E Solution Contract by Fujitsu based on Red Hat Ceph Enterprise

Easy Deployment Management by Fujitsu

+

+

+ Lifecycle Management for Hardware amp Software by Fujitsu

+

6

Fujitsu Maintenance Support and Professional Services

ETERNUS CD10000 A complete offer

Copyright 2015 FUJITSU

7

Unlimited Scalability

Cluster of storage nodes

Capacity and performance scales by

adding storage nodes

Three different node types enable

differentiated service levels

Density capacity optimized

Performance optimized

Optimized for small scale dev amp test

1st version of CD10000 (Q32014) is

released for a range of 4 to 224 nodes

Scales up to gt50 Petabyte

Copyright 2015 FUJITSU

Basic node 12 TB Performance node 35 TB Capacity node 252 TB

8

Immortal System

Copyright 2015 FUJITSU

Node1 Node2 Node(n)

+

Adding nodes

with new generation

of hardware

helliphelliphellip +

Adding nodes

New Node1

Non-disruptive add remove exchange of hardware (disks and nodes)

Mix of nodesdisks of different generations online technology refresh

Very long lifecycle reduces migration efforts and costs

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

5

ETERNUS CD10000 ndash Making Ceph enterprise ready

Build Ceph source storage yourself Out of the box ETERNUS CD10000

incl support

incl maintenance

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

E2E Solution Contract by Fujitsu based on Red Hat Ceph Enterprise

Easy Deployment Management by Fujitsu

+

+

+ Lifecycle Management for Hardware amp Software by Fujitsu

+

6

Fujitsu Maintenance Support and Professional Services

ETERNUS CD10000 A complete offer

Copyright 2015 FUJITSU

7

Unlimited Scalability

Cluster of storage nodes

Capacity and performance scales by

adding storage nodes

Three different node types enable

differentiated service levels

Density capacity optimized

Performance optimized

Optimized for small scale dev amp test

1st version of CD10000 (Q32014) is

released for a range of 4 to 224 nodes

Scales up to gt50 Petabyte

Copyright 2015 FUJITSU

Basic node 12 TB Performance node 35 TB Capacity node 252 TB

8

Immortal System

Copyright 2015 FUJITSU

Node1 Node2 Node(n)

+

Adding nodes

with new generation

of hardware

helliphelliphellip +

Adding nodes

New Node1

Non-disruptive add remove exchange of hardware (disks and nodes)

Mix of nodesdisks of different generations online technology refresh

Very long lifecycle reduces migration efforts and costs

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

6

Fujitsu Maintenance Support and Professional Services

ETERNUS CD10000 A complete offer

Copyright 2015 FUJITSU

7

Unlimited Scalability

Cluster of storage nodes

Capacity and performance scales by

adding storage nodes

Three different node types enable

differentiated service levels

Density capacity optimized

Performance optimized

Optimized for small scale dev amp test

1st version of CD10000 (Q32014) is

released for a range of 4 to 224 nodes

Scales up to gt50 Petabyte

Copyright 2015 FUJITSU

Basic node 12 TB Performance node 35 TB Capacity node 252 TB

8

Immortal System

Copyright 2015 FUJITSU

Node1 Node2 Node(n)

+

Adding nodes

with new generation

of hardware

helliphelliphellip +

Adding nodes

New Node1

Non-disruptive add remove exchange of hardware (disks and nodes)

Mix of nodesdisks of different generations online technology refresh

Very long lifecycle reduces migration efforts and costs

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

7

Unlimited Scalability

Cluster of storage nodes

Capacity and performance scales by

adding storage nodes

Three different node types enable

differentiated service levels

Density capacity optimized

Performance optimized

Optimized for small scale dev amp test

1st version of CD10000 (Q32014) is

released for a range of 4 to 224 nodes

Scales up to gt50 Petabyte

Copyright 2015 FUJITSU

Basic node 12 TB Performance node 35 TB Capacity node 252 TB

8

Immortal System

Copyright 2015 FUJITSU

Node1 Node2 Node(n)

+

Adding nodes

with new generation

of hardware

helliphelliphellip +

Adding nodes

New Node1

Non-disruptive add remove exchange of hardware (disks and nodes)

Mix of nodesdisks of different generations online technology refresh

Very long lifecycle reduces migration efforts and costs

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

8

Immortal System

Copyright 2015 FUJITSU

Node1 Node2 Node(n)

+

Adding nodes

with new generation

of hardware

helliphelliphellip +

Adding nodes

New Node1

Non-disruptive add remove exchange of hardware (disks and nodes)

Mix of nodesdisks of different generations online technology refresh

Very long lifecycle reduces migration efforts and costs

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

9

TCO optimized

Based on x86 industry standard architectures

Based on open source software (Ceph)

High-availability and self-optimizing functions are part

of the design at no extra costs

Highly automated and fully integrated management

reduces operational efforts

Online maintenance and technology refresh reduce

costs of downtime dramatically

Extreme long lifecycle delivers investment protection

End-to-end design an maintenance from Fujitsu

reduces evaluation integration maintenance costs

Copyright 2015 FUJITSU

Better service levels at reduced costs ndash business centric storage

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

10

One storage ndash seamless management

ETERNUS CD10000 delivers one seamless

management for the complete stack

Central Ceph software deployment

Central storage node management

Central network management

Central log file management

Central cluster management

Central configuration administration and

maintenance

SNMP integration of all nodes and

network components

Copyright 2015 FUJITSU

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

11

Seamless management (2)

Dashboard ndash Overview of cluster status

Server Management ndash Management of cluster hardware ndash addremove server

(storage node) replace storage devices

Cluster Management ndash Management of cluster resources ndash cluster and pool creation

Monitoring the cluster ndash Monitoring overall capacity pool utilization status of OSD

Monitor and MDS processes Placement Group status and RBD status

Managing OpenStack Interoperation Connection to OpenStack Server and

placement of pools in Cinder multi-backend

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

12

Optional use of Calamari Management GUI

12

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

13

Example Replacing an HDD

Plain Ceph

taking the failed disk offline in Ceph

taking the failed disk offline on OS

Controller Level

identify (right) hard drive in server

exchanging hard drive

partitioning hard drive on OS level

Make and mount file system

bring the disk up in Ceph again

On ETERNUS CD10000

vsm_cli ltclustergt replace-disk-out

ltnodegt ltdevgt

exchange hard drive

vsm_cli ltclustergt replace-disk-in

ltnodegt ltdevgt

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

14

Example Adding a Node

Plain Ceph

Install hardware

Install OS

Configure OS

Partition disks (OSDs Journals)

Make filesystems

Configure network

Configure ssh

Configure Ceph

Add node to cluster

On ETERNUS CD10000

Install hardware

bull hardware will automatically PXE boot

and install the current cluster

environment including current

configuration

Node automatically available to GUI

Add node to cluster with mouse click

on GUI

bull Automatic PG adjustment if needed

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

15

Adding and Integrating Apps

The ETERNUS CD10000 architecture

enables the integration of apps

Fujitsu is working with customers and

software vendors to integrate selected

storage apps

Eg archiving sync amp share data

discovery cloud appshellip

Copyright 2015 FUJITSU

Cloud

Services

Sync

amp Share Archive

iRODS data

discovery

ET

ER

NU

S C

D1

00

00

Object Level

Access

Block Level

Access

File Level

Access

Central Management

Ceph Storage System SW and Fujitsu

Extensions

10GbE Frontend Network

Fast Interconnect Network

Perf

orm

ance N

odes

Capacity N

odes

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

16

ETERNUS CD10000 at University Mainz

Large university in Germany

Uses iRODS Application for library services

iRODS is an open-source data management software in use at research

organizations and government agencies worldwide

Organizes and manages large depots of distributed digital data

Customer has built an interface from iRODS to Ceph

Stores raw data of measurement instruments (eg research in chemistry and

physics) for 10+ years meeting compliance rules of the EU

Need to provide extensive and rapidly growing data volumes online at

reasonable costs

Will implement a sync amp share service on top of ETERNUS CD10000

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

17

Summary ETERNUS CD10k ndash Key Values

Copyright 2015 FUJITSU

ETERNUS CD10000

ETERNUS

CD10000

Unlimited

Scalability

TCO

optimized

The new

unified

Immortal

System

Zero

Downtime

ETERNUS CD10000 combines open source storage with enterprisendashclass quality of service

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

18

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

19

What is OpenStack

Free open source (Apache license) software governed by a non-profit foundation

(corporation) with a mission to produce the ubiquitous Open Source Cloud

Computing platform that will meet the needs of public and private clouds

regardless of size by being simple to implement and massively scalable

Platin

Gold

Corporate

hellip

hellip

Massively scalable cloud operating system that

controls large pools of compute storage and

networking resources

Community OSS with contributions from 1000+

developers and 180+ participating organizations

Open web-based API Programmatic IaaS

Plug-in architecture allows different hypervisors

block storage systems network implementations

hardware agnostic etc

httpwwwopenstackorgfoundationcompanies

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

20

Attained fast growing customer interest

VMware clouds dominate

OpenStack clouds already 2

Worldwide adoption

Source OpenStack User Survey and Feedback Nov 3rd 2014

Source OpenStack User Survey and Feedback May 13th 2014

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

21

Why are Customers so interested

Source OpenStack User Survey and Feedback Nov 3rd 2014

Greatest industry amp community support

compared to alternative open platforms

Eucalyptus CloudStack OpenNebula

ldquoAbility to Innovaterdquo jumped from 6 to 1

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

22

OpenStackorg User Survey Paris Nov 2014

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

23

OpenStack Cloud Layers

OpenStack and ETERNUS CD10000

Physical Server (CPU Memory SSD HDD) and Network

Base Operating System (CentOS)

OAM

-dhcp

-Deploy

-LCM

Hypervisor

KVM ESXi

Hyper-V

Compute (Nova)

Network

(Neutron) +

plugins

Dashboard (Horizon)

Billing Portal

OpenStack

Cloud APIs

RADOS

Block

(RBD)

S3

(Rados-GW)

Object (Swift) Volume (Cinder)

Authentication (Keystone)

Images (Glance)

EC2 API

Metering (Ceilometer)

Manila (File)

File

(CephFS)

Fujitsu

Open Cloud

Storage

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

24

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

25

Backgrounds (1)

Erasure codes for content data

Content data for ICT services is ever-growing

Demand for higher space efficiency and durability

Reed Solomon code (de facto erasure code) improves both

Reed Solomon Code (Old style)Triple Replication

However Reed Solomon code is not so recovery-efficient

content data

copy copy

3x space

parity parity

15x space

content data

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

26

Backgrounds (2)

Local parity improves recovery efficiency

Data recovery should be as efficient as possible

bull in order to avoid multiple disk failures and data loss

Reed Solomon code was improved by local parity methods

bull data read from disks is reduced during recovery

Data Chunks

Parity Chunks

Reed Solomon Code

(No Local Parities) Local Parities

data read from disks

However multiple disk failures is out of consideration

A Local Parity Method

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

27

Local parity method for multiple disk failures

Existing methods is optimized for single disk failure

bull eg Microsoft MS-LRC Facebook Xorbas

However Its recovery overhead is large in case of multiple disk failures

bull because they have a chance to use global parities for recovery

Our Goal

A Local Parity Method

Our goal is a method efficiently handling multiple disk failures

Multiple Disk Failures

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

28

SHEC (= Shingled Erasure Code)

An erasure code only with local parity groups

bull to improve recovery efficiency in case of multiple disk failures

The calculation ranges of local parities are shifted and partly overlap with each

other (like the shingles on a roof)

bull to keep enough durability

Our Proposal Method (SHEC)

k data chunks (=10)

m

parity

chunks

(=6)

l calculation range (=5)

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

29

1 mSHEC is more adjustable than Reed Solomon code

because SHEC provides many recovery-efficient layouts

including Reed Solomon codes

2 mSHECrsquos recovery time was ~20 faster than Reed

Solomon code in case of double disk failures

3 mSHEC erasure-code included in Hammer release

4 For more information see httpswikicephcomPlanningBlueprintsHammerShingled_Erasure_Code_(SHEC)

or ask Fujitsu

Summary mSHEC

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

30

The safe way to make Ceph storage enterprise ready

ETERNUS CD10k integrated in OpenStack

mSHEC Erasure Code from Fujitsu

Contribution to performance enhancements

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

31

Areas to improve Ceph performance

Ceph has an adequate performance today

But there are performance issues which prevent us from taking full

advantage of our hardware resources

Two main goals for improvement

(1) Decrease latency in the Ceph code path

(2) Enhance large cluster scalability with many nodes ODS

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

32

LTTng general httplttngorg

General

open source tracing framework for Linux

trace Linux kernel and user space applications

low overhead and therefore usable on

production systems

activate tracing at runtime

Ceph code contains LTTng trace points already

Our LTTng based profiling

activate within a function collect timestamp information at the interesting places

save collected information in a single trace point at the end of the function

transaction profiling instead of function profiling use Ceph transaction ids to

correlate trace points

focused on primary and secondary write operations

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

33

Turn Around Time of a single Write IO

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

34

LTTng data evaluation Replication Write

Observation

replication write latency suffers from the large variance problem

minimum and average differ by a factor of 2

This is a common problem visible for many ceph-osd components

Why is variance so large

Observation No single hotspot visible

Observation Active processing steps do not differ between minimum and average

sample as much as the total latency does

Additional latency penalty mostly at the switch from

sub_op_modify_commit to Pipewriter

no indication that queue length is the cause

Question Can the overall thread load on the system and Linux scheduling be the

reason for the delayed start of the Pipewriter thread

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

35

Thread classes and ceph-osd CPU usage

Thread per ceph-osd depends on complexity of Ceph cluster 3x node with 4 OSDs

each ~700 threads per node 9x nodes with 40 OSDs each gt 100k threads per node

ThreadPoolWorkThread is a hot spot = work in the ObjectStore FileStore

total CPU usage during test 4317 CPU seconds

PipeWriter 459 1063

PipeReader 581 1345

ShardedThreadPoolWorkThreadSharded 808 1870

ThreadPoolWorkThread 1556 3604

FileJournalWriter 241 557

FileJournalWriteFinisher 101 233

Finisherfinisher_thread_entry 286 663

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

36

FileStore benchmarking

most of the work is done in FileStoredo_transactions

each write transaction consists of

3 calls to omap_setkeys

the actual call to write to the file system

2 calls to setattr

Proposal coalesce calls to omap_setkeys

1 function call instead of 3 calls set 5 key value pairs instead of 6 (duplicate key)

Official change was to coalesce at the higher PG layer

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

37

Other areas of investigation and improvement

Lock analysis

RWLock instead of mutex

Start with CRC locks

Bufferlist tuning

Optimize for jumbo packets

malloc issues

Copyright 2015 FUJITSU

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

38

Summary and Conclusion

ETERNUS CD10k is the safe way to make Ceph enterprise ready

Unlimited Scalability 4 to 224 nodes scales up to gt50 Petabyte

Immortal System with Zero downtime Non-disruptive add remove exchange of hardware (disks and nodes) or Software update

TCO optimized Highly automated and fully integrated management reduces operational efforts

Tight integration in OpenStack with own GUI

Fujitsu will continue to enhance ease-of-use and performance

This is important

As Cephrsquos popularity increases competitors will attack Ceph in these areas

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom

39 39 Copyright 2015 FUJITSU

pvonstamwitzusfujitsucom