19
ddn.com ©2013 DataDirect Networks. All Rights Reserved. Lustre QoS based on NRS policy of Token Bucket Filter 2014/10/14 Shuichi Ihara, Li Xi DataDirect Networks

Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

  • Upload
    others

  • View
    7

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Lustre QoS

based on NRS policy of Token Bucket Filter

2014/10/14

Shuichi Ihara, Li Xi

DataDirect Networks

Page 2: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Why Lustre QoS?

► Lustre is able to provide scalable throughput and IOPS which

could be increased linearly with the number of OSTs/MDTs

► Storage systems in HPC centers are usually shared by multiple

organizations and various applications

► The ability to guarantee sustained performance of file system is

essential

• User experience, e.g. intolerable delay of ‘ls’

• Workloads that have certain performance requirements

• Increasing application areas outside the mainstream HPC, e.g. large

parallel application (“file-per-process”) use cases

► Need to provide mechanism to “allocate” or “limit” performance

2

Page 3: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Quality of Service (QoS)

► Quality of Service (QoS) is a mechanism to ensure a

"guaranteed” performance

► QoS was developed mostly in the network world, and

especially, on TCP/IP networks, which pose specific QoS

challenges

► QoS features are available on many network hardware or

network management software products

► QoS is somewhat less common in the storage world, although

some (expensive) enterprise storage products claim QoS or

QoS-like features

3

Page 4: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Lustre QoS

► Lustre QoS is hard given the complex software stacks of

Lustre

• Cache affect performance significantly

• Both server and client side mechanisms are needed for full control

• Extra communication between server and client might be needed

• Performance of ordinary operations should not be affected

► We present the first steps towards functional Lustre QoS:

NRS TBF policies

• Classify RPCs according to their NID/JOBID/OPCODE

• Reschedule RPCs based on their classifications

• Throttle RPC rate by controlling token rate

4

Page 5: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

The Network Request Scheduler (NRS)

► A component of the PTLRPC service

► NRS works on the server side and allows handling of

incoming RPCs before passing them to the OSS/MDS

threads for the backend file system

► This framework, together with a few policy options, was

merged into the Lustre mainstream and has been available

since Lustre 2.4.0

► The NRS Framework is very flexible and it is fairly easy

and straightforward to add new policies

► Policies can reschedule RPCs based on

NID/UID/GID/JOBID, etc..

5

Page 6: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

QoS Algorithm: TBF

► Many types of QoS algorithms have been developed over

the past few decades

► The Token Bucket Filter (TBF) is a major algorithm used in

general network systems

• It's simple and easy to implement

• Many Ethernet switches and routers use TBF to enable QoS

features

• TBF can accommodate very small burst traffic, but is also OK for

long-term data transmission

6

Page 7: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

The Token Bucket Filter (TBF)

7

Token

rate R(t)

INPUT

Bucket depth B

Tokens replenished at rate R(t)

OUTPUT

Transfer Rate O(t)

In some case, very

small burst traffic B

When input data arrives, but if no

token available, it waits until

enough tokens are ready.

O(t) ≤ R(t) + BN input data per

N tokens

Page 8: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

NRS TBF policy of Lustre

8

Dequeue

based on

deadlines

Token

buckets

FIFO

queues

Enqueue

based on ID

Incoming

RPCTokens

Handling

RPC

Page 9: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

TBF patches for Lustre

► LU-3494 libcfs: Add relocation function to libcfs heap

• Add a function to efficiently change the rank of queue

► LU-3558 ptlrpc: Add the NRS TBF policy

• Add main framework of TBF policy along with NID/JOBID

based policies

► LU-5580 ptlrpc: policy switch directly in tbf

• Fix the problem of unable to switch TBF policies directly

► LU-5620 ptlrpc: Add QoS for opcode in NRS-TBF

• Add TBF policy based on opcode of RPC

9

Merged!

Merged!

Review

Review

Page 10: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Use cases

► variants of TBF policy have been developed for various use

cases

► TBF policy based on NID

• Manage performance distribution between clients

► TBF policy based on Job ID

• Manage performance distribution between jobs

• Could be integrated with job management systems

► TBF policy based on Opcode

• Manage performance distribution between different kinds of

operations

10

Page 11: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

High priority

“performance”

Lustre clients

Use Case #1

Lustre QoS based on NIDs (Clients)

11

Lustre Server MDS/OSS

..........

High priority Clients

High bandwidth Total available bandwidthLimited

(Lower)

bandwidth

..........

Low priority Clients

Submitted jobs

..........

UserA

Job-X Job-Y

Page 12: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Low

bandwidthHigh bandwidth

Use Case #2

Lustre QoS based on JOBID

12

Lustre Server MDS/OSS

Clients

Maximum Lustre bandwidth

UserA

..........

..........

UserB

Submitted jobs

UserA+JOB-X has higher priority

Job-X Job-Y

Page 13: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

High priority

opcode

Use case #3

Lustre QoS based on Opcode

13

Lustre Server MDS/OSS

Total available RPC rate

Limited

(Lower)

RPC rate

Operations

..........

User

Opcode A Opcode B

Clients

..........

High RPC rate

Page 14: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

How to use TBF policies

Change NRS policy to TBF with NID

# lctl set_param ost.OSS.ost_io.nrs_policies="<NRS policy> <TBF argument>"

# lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"

Set rule with classification and number of token rate

# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start <TBF's rule name> {NID} <rate>"

# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start rule_client1 {192.168.1.1@o2ib} 1"

# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start rule_clients {192.168.1.[2-16]@o2ib} 10"

Change number of token rate

# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change <TBF's rule name> <new rate>"

# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change rule_client1 100"

Stop a rule (delete)

# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop <TBF's rule name>"

# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop rule_client1"

14

Page 15: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Test result #1:

Lustre QoS based on TBF NID

15

"dd" command to single OST from multiple clients with various QoS rules

(Write, 1MB IO, max_rpc_in_flright=32)

Page 16: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Test result #2:

Lustre QoS based on TBF JOBID

16

Start JOBstats and chang NRS policy to TBF with JOBID# lctl set_param jobid_var=procname_uid

# lctl set_param ost.OSS.ost_io.nrs_policies="tbf jobid"

Set rule with classification and number of token# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start <TBF's rule name> {JOBID} <rate>"

# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start iozone_user1 {iozone.500} 1"

Change number of token# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change iozone_user1 X" (change X to 10,50 and 100)

Stop a rule (delete)# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop iozone_user1"

start token=1 token=10

token=50token=100

stop default #token

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170

MB

/sec

user1's(uid=500) iozone(1M, Write)

Page 17: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Test result #3:

Lustre QoS based on TBF OPCODE

17

Chang NRS policy to TBF with OPCODE# lctl set_param jobid_var=procname_uid

# lctl set_param ost.OSS.ost_io.nrs_policies="tbf opcode"

Set rule with classification and number of token# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start <TBF's rule name> {OPCODE} <rate>"

# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="start write_limit {ost_write} 50"

Change number of token# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="change write_limit X" (change X to 10,50 and 100)

Stop a rule (delete)# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop write_limit"

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Read&Write, TBF, limit write

Only Read, TBF, limit write

Read&Write, TBF, no rule

Only Read, TBF, no rule

Read&Write, FIFO

Only read, FIFO

Read Performance Write Performance MB/s

Results of IOR benchmark

Page 18: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Summary

► We present an server side QoS mechanism which

combined the Lustre Network Request Scheduler (NRS)

framework and traditional Token Bucket Filter (TBF)

algorithm

► Variants of this policy have been developed for different

requirements and purposes and have proved to be useful

respectively in different use cases.

► Further work

• Multi-layered NRS framework

• Client side QoS

18

Page 19: Lustre QoS based on NRS policy of Token Bucket Filtercdn.opensfs.org/wp-content/uploads/2014/10/7-DDN_LiXi_lustre_QoS.pdf · Quality of Service (QoS) Quality of Service (QoS) is a

ddn.com©2013 DataDirect Networks. All Rights Reserved.

Thank you!

19