32
Real-Time Throughput Making Real-Time Linux Enterprise Ready Gregory Haskins Senior Software Architect - Linux Solutions Group [email protected] Portions copyright (C) 2008 Sven-Thorsten Dietrich License: Creative Commons http://creativecommons.org/licenses/by-nc/2.0 Real-Time Throughput Making Real-Time Linux Enterprise Ready Steve Rostedt Senior Software Engineer [email protected] ([email protected])

Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

Real-Time ThroughputMaking Real-Time Linux Enterprise Ready

Gregory HaskinsSenior Software Architect - Linux Solutions [email protected]

Portions copyright (C) 2008 Sven-Thorsten DietrichLicense: Creative Commons http://creativecommons.org/licenses/by-nc/2.0

Real-Time ThroughputMaking Real-Time Linux Enterprise Ready

Steve RostedtSenior Software [email protected]([email protected])

Page 2: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

2

Presentation Overview

• Introduction to Real-Time and PREEMPT_RT– What it is– Use cases– How it works and performance characteristics

• Challenges for deploying Real-Time in the Enterprise• Technical developments to make Real-Time Enterprise Ready

Page 3: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

What is Real-Time?

Page 4: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

4

What is Real-Time?

• Program execution time known– Real-Time is deterministic program execution– Program execution time bounded by maximum– RT response requirements are property of the application

• Soft Real-Time– Value of the computation diminishes after deadline passes

> Pointer slow to change to hour glass after mouse click

• Hard Real-Time– Value of computation at most 0 after deadline passes

> Fission chamber fuel-rod retraction

Copyright (C) 2008 Sven-Thorsten Dietrich

Page 5: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

5

The Car example

Page 6: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

6

Real-Time Use Cases

• Entertainment Media– AV playback– Gaming

• AV editing/mixing• Financial Services

– Algorithmic trading– RegNMS compliance

• Embedded Control Systems

• High Performance Transaction / Database Servers

• Network QoS• Telephony• Event Based Systems• Industrial Automation

Page 7: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

7

What is Real-Time Linux?

• PREEMPT_RT patches modify the upstream linux-2.6 kernel to offer increased deterministic response

– First emerged around 2005 by Ingo Molnar– Led/maintained by Ingo, Steven Rostedt, and Thomas

Gleixner.– Developed by a small group of FOSS engineers/companies

from around the world.

• Technology developed for PREEMPT_RT has trickled its way upstream, improving Linux for everyone.

– Lockdep, bug fixes, ftrace, scheduler improvements, etc.

• Intending for full merge with upstream over the next few releases (2.6.28 - 2.6.32?)

Page 8: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

8

How does Real-Time Linux work?

• PREEMPT_RT eliminates most non-preemptible code:

– Most interrupt handlers are converted to threads– Most kernel spinlocks are converted to mutexes which

support priority queueing and PI.– Other locking constructs are re-worked as well

> Big Kernel Lock (BKL), rw-locks converted to PI-mutex> Preemptible RCU enhancements> User-Space Mutex – pthread compatible, Robustness / Dead-Owner /

Priority Queueing

• Task scheduler manages RT threads as a (virtual) global queue.

Page 9: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

9

Kernel Evolution: Preemptible Code

Kernels 2.2-2.4

PreemptibleKernel 2.4Kernel 2.6

Preemptible Non-Preemptible

Kernel 2.0

Real-Time Kernel 2.6

Copyright (C) 2008 Sven-Thorsten Dietrich

Page 10: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

10

Linux 2.6 – Latency Comparison

Copyright (C) 2008 Sven-Thorsten Dietrich

No Preemption Real-Time Preemption

Page 11: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

11

Learning more

•http://rt.wiki.kernel.org/

Page 12: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

What Real-Time is Not

Page 13: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

13

What Real-Time is not

The Costs of Determinism:– Degraded throughput– Increased minimum latencies

The Bottom Line: – Real-Time is not Real Fast (it's predictable)– Throughput will always come second to determinism

> We try to maximize both whenever possible

Throughput High responsiveness

Portions Copyright (C) 2008 Sven-Thorsten Dietrich

Page 14: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

14

Why does a throughput penalty exist?

• Part of the reason is simple physics– We add more overhead to the system in order to add flexibility

and responsiveness> Threaded interrupts increase context switching rates> Contending on a pi-mutex is more complex than a simple spinlock, etc

• Part of the reason is maturity– PREEMPT_RT is very young w.r.t. the mainline kernel

> Threaded interrupts obsolete the need for most deferment mechanisms (softirqs, tasklets, etc) yet both currently remain

> CFS has trouble dealing with CPU-intensive/frequent RT tasks> etc

– Historically RT is for embedded control systems with low emphasis on IO performance

Page 15: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

15

Throughput - The Bottom Line

• There will always be a degree of unavoidable compromise.

• HOWEVER, much of the observed throughput degradation has been (or will be) fixed in the near term through enhancements.

– In the past year alone, we have seen PREEMPT_RT improve its IO capabilities by over 700%

– In many cases, it has achieved IO parity with its non-RT mainline counterpart

– More breakthroughs are coming!

• The purpose of this talk is to discuss these enhancements

Page 16: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

Making Real-Time Linux Enterprise Ready

Page 17: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

17

Real-Time in the Enterprise

• PREEMPT_RT is already very good at Real-Time– Modern hardware + PREEMPT_RT can easily achieve

average latencies in < 10us, with maximums somewhere between 30us-150us nearly “out of the box”.

• However, when contrasted with the embedded space, enterprise environments typically:

– Place additional emphasis on IO performance– Have larger core counts

> 1-2 cores typical in embedded> 4, 8, 16, 32++ cores typical in enterprise

• Goal: Retain the excellent RT characteristics while addressing IO throughput and CPU scalability

Page 18: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

18

Real-Time Task Balancing

• Worked with Steven to redesign RT scheduler from “spray-n-pray” to a distinct push/pull design

– SCHED_OTHER uses per-cpu runqueues with periodic (passive) balancing. RT treats the system as a virtual global run-queue and rebalances tasks actively according to priority.

– Tasks are “pushed” to preempt the lowest priority run-queue during wakeup.

– Overloaded tasks are pulled from other runqueues whenever the given runqueue priority drops.

• Keeps the highest priority tasks on the cpus at all times– Corrects bugs w.r.t. global misprioritization– Eliminates wasted IPIs and lock contention– Reduces latency

Page 19: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

19

Two Dimensional Priority Searching

• We must track the priority of every CPU in the system in order to efficiently move RT tasks around a virtually global run-queue.

• We must also be able to quickly perform a minimum search against the current state of the system.

– We introduce a 2-dimensional bitmap structure to maintain state for all CPUs.

– The first bitmap represents priority, from lowest to highest.– Each entry in the first bitmap points to another bitmap: a

cpus_t of member CPUs.

• Using an O(1) bitfield search (typically in hardware) yields a bitmap containing all the lowest priority CPUs.

Page 20: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

20

Root DomainsLarge system partitioning

• Managing a virtually global run-queue inherently adds global data.

– “Overload” state needs to be advertised so remote CPUs know when its time to try to pull tasks over

– The current priority state (i.e. 2-d bitmap) needs to be shared

• A “root-domain” is short for a root scheduling-domain.– There's no reason to share data across disjoint cpu-sets– We place virtually-global data into a root-domain to isolate

scheduler interference across exclusive sets of cores.

• Allows larger SMP nodes to truly partition and isolate unrelated interference.

Page 21: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

21

Optimize non-migratory tasks

• Overhead can be reduced if a task is detected to be restricted to only one core (e.g. via cpus_allowed).

– However, interacting with cpus_allowed directly in the fast-path can be relatively expensive.

• A key observation is that cpus_allowed is updated infrequently

– Therefore, compute the hamming weight of cpus_allowed at update time

– cache it in the task_struct as a speed/memory tradeoff

• weight == 1 means the task can skip all migration related checks

Page 22: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

22

Cache Topology Awareness

• Modern multi-core CPUs have an inherent hierarchy w.r.t. the cache topology.

– E.g. cpus within a socket or cores on a die may share cache resources

• Linux already knows how to nominally interpret and represent these relationships (called sched-domains).

• We reference this topology information to make the best routing decisions for RT tasks.

– Tasks will migrate as close to their respective cache-hot data as possible.

Page 23: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

23

Topology Awareness (cont)Buddy-wake

• Tasks that have a tight coupled relationship (e.g. producer/consumer) will tend to run best when topologically “close” to one another in the hierarchy.

• 2.6.25 introduced a “buddy-wake” algorithm to CFS tasks by overload wake_sync() operations to indicate a possible coupling.

• Extending this algorithm to support RT tasks as well boosts IO performance for RT tasks by over 25%.

Page 24: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

24

Topology Awareness (cont)Bi-directional Affinity

• The kernel already has the notion of cpu affinity. – A task will tend to run on the same CPU across different

scheduling horizons

• It will also attempt to establish a relationship between a waker and wakee

– Move the wakee to the same cpu as the waker (i.e. buddy)

• However, the “waker” cpu may not always be ideal– Perhaps the waker is fairly loaded, but a topological neighbor

to the waker is idle– Performance is gained by moving the wakee closer to the

waker, but not all the way to the waker itself.

Page 25: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

25

Priority Inheritance• Created generic PI library

– Existing PI logic is entwined in rtmutex library– Break out the logic into a separate libpi library for general use

• New uses for PI API– Pi-workqueues (upstream)

> Priortizable work-items can boost the kthread

– pi-waitqueues (in development)> Tasks blocked on a resource will update their queue position with priority

changes. Highest wakes first.

– Futexes ?> Futexes currently use rtmutex-proxy infrastructure, but could be adapted> Possible applications for futex requeue_pi problem.

Page 26: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

26

Lateral Lock Steal

• Priority-queued pi-mutexes grant the lock to the next waiter on release as PENDING

– If a high-priority thread contends on the lock in the PENDING state, it can preempt the next-waiter in favor of itself.

• New algorithm extends this “steal” to tasks of equal priority

– The observation is that it is more efficient to allow the already running thread to take the lock than to sleep a perfectly runnable task of equal priority

• Currently only supports SCHED_OTHER tasks to avoid an unbounded latency

– Can be extended to RT tasks with some additional work

Page 27: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

27

Adaptive Locks

• Non-RT uses normal spinlocks throughout the kernel, whereas RT converts most of these to a pi-mutex.

– Non-contended pi-mutex ~ equivalent to a normal spinlock– Contention enters the slow path and ultimately sleeps

• The observation is that most original uses of spinlocks are “short-hold” locks.

– Therefore, even contended locks will tend to release shortly after the contention is detected

– Sleeping the waiter is generally wasted thrashing

• Solution: develop a preemption-enabled friendly adaptive spin/sleep algorithm for pi-mutexes.

Page 28: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

28

PI Optimization of RT-mutex

• Observation: Most lock acquisitions occur either via fast-path or adaptive-spin

– However, all slow-path acquisitions (including adaptive) pi-boost the owner regardless of whether it was needed

– Pi-boosting adds considerable overhead

• Therefore, we can optimize the slow-path to only pi-boost when necessary. We must boost the owner if:

– the waiter must adaptively sleep– the owner-prio is < waiter->prio

• Otherwise, we can skip the overhead and simply monitor our priority environment for changes

Page 29: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

29

Priority Inheritance Read/Write Locks

• Priority Inheritance is complex• Current approach is one owner : one lock• RW Locks (and RW semaphores) have multiple owners (only one writer, but many readers)

• Limit RW locks owners (Default 1 per CPU)• /proc/sys/kernel/rwlock_reader_limit

Page 30: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

30

Questions?

Page 31: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

31

Processor Quantum (PQ)

• CFS “load” is irrelevant for RT tasks– RT-tasks/IRQs can run as frequent and as long as they wish

w.r.t. CFS tasks.– Load mis-accounting causes problems when intermixing RT

tasks with CFS tasks– Intermixing tasks happens very frequently in PREEMPT_RT,

but can happen in mainline as well

• PQ tracks the amount of “computational timeslices” available for tasks at or below a given priority

• Load can be scaled up or down depending on PQ rating to properly account for RT tasks

Page 32: Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

32

Networking Enhancements• Observation: network IO causes lots of context switching

– Solution: Collapse NAPI bottom-half processing into the (threaded) IRQ

> Convert NET-RX softirq to a workqueue to take advantage of existing infrastructure

> Deferred-workitems allow for an easy “Software interrupt coalescing” feature to drop in

» Can schedule the workitem in the future independent of hardware support for coalescing

– Support ingress prioritization through the entire stack> Extend NAPI to allow queue prioritization end-to-end with application> Extend workqueue API to accept prio parameter instead of inheriting from

the calling task