27
Toward A Unified Block IO Controller Shaohua Li <[email protected]> Software Engineer, Facebook

Shaohua Li

Embed Size (px)

Citation preview

Page 1: Shaohua Li

Toward A Unified Block IO ControllerShaohua Li <[email protected]>Software Engineer, Facebook

Page 2: Shaohua Li

Agenda• Overview• Unified IO Controller and Challenges• The Solution• Benchmark Data• TODO

Page 3: Shaohua Li

Overview

Page 4: Shaohua Li

IO Controller• Share IO resources between tasks• Maintain fairness with specific policy• 2 policies• CFQ• Block-throttle

Page 5: Shaohua Li

CFQ IO Controller• Based on CFQ ioscheduler• Proportion based• Time slice/IOPS accounting• Fair*• Performance issues• Not scale• Idle disk for fairness

Page 6: Shaohua Li

Block-throttle• Throttle cgroup to upper limit• Bandwidth/IOPS based• No proportional scheduling• User sets upper limit

Page 7: Shaohua Li

Blk-mq Challenges• Multiple queues• Target devices have high queue depth• Scalable design• No IO scheduler

Page 8: Shaohua Li

Unified IO Controller and Challenges

Page 9: Shaohua Li

Unified IO Controller• Has both proportion and upper limit policy• Work for blk-mq• Scalable• Block-throttle is a good candidate•Work for blk-mq•Has global lock but not too bad• Potentially per-cpu cache for better scalability•Must add proportional policy

Page 10: Shaohua Li

Block-throttle Workflow

charge IO

within limit?

dispatch IO

calculate sleep time

sleep

limit = time slice * bps

New IO

NO

Yes

sleep time = IO size / bps

Page 11: Shaohua Li

A Magic Disk

Weight 60Weight 40

Disk Capability

CGROUP 1 CGROUP 2

Page 12: Shaohua Li

The Cruel World• Disk capability isn’t fixed• IO size• read/write ratio• IO depth• Queue depth• Sequential/random• Measure IO cost• IO size• IOPS• IO time• Combine them?

Page 13: Shaohua Li
Page 14: Shaohua Li

The Solution

Page 15: Shaohua Li

Suboptimal Solution• Use bandwidth or IOPS• Capability is total bandwidth or IOPS• IO cost is IO size or 1 per request• User choose• Adaptive• No fixed total bandwidth or IOPS• Feedback system, try and push to steady state

Page 16: Shaohua Li

Disk Bandwidth• Estimate bandwidth• Bandwidth = current bandwidth / disk utilization• Disk could be underutilized even with 100% utilization• Throttled tasks can’t dispatch more IO• Always slightly over estimate• Over estimate bandwidth•Workload gets bigger limit•Workload sends more IO• Estimate higher bandwidth• Reach steady state in max bandwidth

Page 17: Shaohua Li

Cgroup Throttle• cgroup share = weight / total weight• cgroup bandwidth limit = cgroup share * estimated disk

bandwidth• Using existing block-throttle mechanism to throttle

Page 18: Shaohua Li

Inactive Cgroup - Example• Disk bandwidth is 100M/s• 2 cgroups with 50% share, each gets 50M/s• Cgroup1 is idle, cgroup2 is 50M/s• Estimated bandwidth is 50M/s• Each gets 25M/s• cgroup1 is idle, cgroup2 is 25M/s• Estimated bandwidth is 25M/s• …

Page 19: Shaohua Li

Inactive Cgroup - Solution• Dynamically adjust cgroup share• Feedback system• Gradually decrease share if underutilized• Recovery to original share if acting limit hits• Defer share decrease if recovery happens recently• Eventually cgroup acting limit = bandwidth

Page 20: Shaohua Li

Benchmark Data

Page 21: Shaohua Li

Benchmark• NVMe disk• Fio job with 1 iodepth, 4k IO and 8 threads• Randread ~330M/s• Randwrite ~1.4GB/s• Emulate inactive cgroup with fio ‘–rate=2M/s’

Page 22: Shaohua Li

Benchmark Result• Cgroup1 weight 200, cgorup2 weight 800; randomwrite• Cgroup1: 322042KB/s, cgroup2: 1020.4MB/s• Cgroup1 weight 200, cgroup2 weight 800 with rate limit

2M/s; randomwrite• Cgroup1: 1367.3MB/s, cgroup2: 2047KB/s

Page 23: Shaohua Li

Benchmark Result• Cgroup1 weight 200, cgroup2 weight 800; randomread• Cgroup1: 296690KB/s, cgroup2: 308275KB/s• Cgroup1 weight 200, cgorup2 weight 800 with rate limit

2M/s; randomread• Cgroup1: 338551KB/s, cgroup2: 2040KB/s

Page 24: Shaohua Li

Benchmark Result• Cgroup1 weight 200 randomwrite, cgorup2 weight 800

randomread• Cgroup1: 168443KB/s, cgroup2: 287954KB/s• Cgroup1 weight 200 randomread, cgorup2 weight 800

randomwrite• Cgroup1: 79038KB/s, cgroup2: 1127.8MB/s

Page 25: Shaohua Li

TODO• More tuning• Separate weight for read and write• Throttle write

Page 26: Shaohua Li

Thank You!

Page 27: Shaohua Li