Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity

Real-Time ThroughputMaking Real-Time Linux Enterprise Ready

Gregory HaskinsSenior Software Architect - Linux Solutions [email protected]

Portions copyright (C) 2008 Sven-Thorsten DietrichLicense: Creative Commons http://creativecommons.org/licenses/by-nc/2.0

Real-Time ThroughputMaking Real-Time Linux Enterprise Ready

Steve RostedtSenior Software [email protected]([email protected])

mailto:[email protected]

2

Presentation Overview

• Introduction to Real-Time and PREEMPT_RT– What it is– Use cases– How it works and performance characteristics

• Challenges for deploying Real-Time in the Enterprise• Technical developments to make Real-Time Enterprise Ready

What is Real-Time?

4

What is Real-Time?

• Program execution time known– Real-Time is deterministic program execution– Program execution time bounded by maximum– RT response requirements are property of the application

• Soft Real-Time– Value of the computation diminishes after deadline passes

> Pointer slow to change to hour glass after mouse click

• Hard Real-Time– Value of computation at most 0 after deadline passes

> Fission chamber fuel-rod retraction

Copyright (C) 2008 Sven-Thorsten Dietrich

5

The Car example

6

Real-Time Use Cases

• Entertainment Media– AV playback– Gaming

• AV editing/mixing• Financial Services

– Algorithmic trading– RegNMS compliance

• Embedded Control Systems

• High Performance Transaction / Database Servers

• Network QoS• Telephony• Event Based Systems• Industrial Automation

7

What is Real-Time Linux?

• PREEMPT_RT patches modify the upstream linux-2.6 kernel to offer increased deterministic response

– First emerged around 2005 by Ingo Molnar– Led/maintained by Ingo, Steven Rostedt, and Thomas

Gleixner.– Developed by a small group of FOSS engineers/companies

from around the world.

• Technology developed for PREEMPT_RT has trickled its way upstream, improving Linux for everyone.

– Lockdep, bug fixes, ftrace, scheduler improvements, etc.

• Intending for full merge with upstream over the next few releases (2.6.28 - 2.6.32?)

8

How does Real-Time Linux work?

• PREEMPT_RT eliminates most non-preemptible code:

– Most interrupt handlers are converted to threads– Most kernel spinlocks are converted to mutexes which

support priority queueing and PI.– Other locking constructs are re-worked as well

> Big Kernel Lock (BKL), rw-locks converted to PI-mutex> Preemptible RCU enhancements> User-Space Mutex – pthread compatible, Robustness / Dead-Owner /

Priority Queueing

• Task scheduler manages RT threads as a (virtual) global queue.

9

Kernel Evolution: Preemptible Code

Kernels 2.2-2.4

PreemptibleKernel 2.4Kernel 2.6

Preemptible Non-Preemptible

Kernel 2.0

Real-Time Kernel 2.6


10

Linux 2.6 – Latency Comparison


No Preemption Real-Time Preemption

11

Learning more

•http://rt.wiki.kernel.org/

http://rt.wiki.kernel.org/

What Real-Time is Not

13

What Real-Time is not

The Costs of Determinism:– Degraded throughput– Increased minimum latencies

The Bottom Line: – Real-Time is not Real Fast (it's predictable)– Throughput will always come second to determinism

> We try to maximize both whenever possible

Throughput High responsiveness

Portions Copyright (C) 2008 Sven-Thorsten Dietrich

14

Why does a throughput penalty exist?

• Part of the reason is simple physics– We add more overhead to the system in order to add flexibility

and responsiveness> Threaded interrupts increase context switching rates> Contending on a pi-mutex is more complex than a simple spinlock, etc

• Part of the reason is maturity– PREEMPT_RT is very young w.r.t. the mainline kernel

> Threaded interrupts obsolete the need for most deferment mechanisms (softirqs, tasklets, etc) yet both currently remain

> CFS has trouble dealing with CPU-intensive/frequent RT tasks> etc

– Historically RT is for embedded control systems with low emphasis on IO performance

15

Throughput - The Bottom Line

• There will always be a degree of unavoidable compromise.

• HOWEVER, much of the observed throughput degradation has been (or will be) fixed in the near term through enhancements.

– In the past year alone, we have seen PREEMPT_RT improve its IO capabilities by over 700%

– In many cases, it has achieved IO parity with its non-RT mainline counterpart

– More breakthroughs are coming!

• The purpose of this talk is to discuss these enhancements

Making Real-Time Linux Enterprise Ready

17

Real-Time in the Enterprise

• PREEMPT_RT is already very good at Real-Time– Modern hardware + PREEMPT_RT can easily achieve

average latencies in < 10us, with maximums somewhere between 30us-150us nearly “out of the box”.

• However, when contrasted with the embedded space, enterprise environments typically:

– Place additional emphasis on IO performance– Have larger core counts

> 1-2 cores typical in embedded> 4, 8, 16, 32++ cores typical in enterprise

• Goal: Retain the excellent RT characteristics while addressing IO throughput and CPU scalability

18

Real-Time Task Balancing

• Worked with Steven to redesign RT scheduler from “spray-n-pray” to a distinct push/pull design

– SCHED_OTHER uses per-cpu runqueues with periodic (passive) balancing. RT treats the system as a virtual global run-queue and rebalances tasks actively according to priority.

– Tasks are “pushed” to preempt the lowest priority run-queue during wakeup.

– Overloaded tasks are pulled from other runqueues whenever the given runqueue priority drops.

• Keeps the highest priority tasks on the cpus at all times– Corrects bugs w.r.t. global misprioritization– Eliminates wasted IPIs and lock contention– Reduces latency

19

Two Dimensional Priority Searching

• We must track the priority of every CPU in the system in order to efficiently move RT tasks around a virtually global run-queue.

• We must also be able to quickly perform a minimum search against the current state of the system.

– We introduce a 2-dimensional bitmap structure to maintain state for all CPUs.

– The first bitmap represents priority, from lowest to highest.– Each entry in the first bitmap points to another bitmap: a

cpus_t of member CPUs.

• Using an O(1) bitfield search (typically in hardware) yields a bitmap containing all the lowest priority CPUs.

20

Root DomainsLarge system partitioning

• Managing a virtually global run-queue inherently adds global data.

– “Overload” state needs to be advertised so remote CPUs know when its time to try to pull tasks over

– The current priority state (i.e. 2-d bitmap) needs to be shared

• A “root-domain” is short for a root scheduling-domain.– There's no reason to share data across disjoint cpu-sets– We place virtually-global data into a root-domain to isolate

scheduler interference across exclusive sets of cores.

• Allows larger SMP nodes to truly partition and isolate unrelated interference.

21

Optimize non-migratory tasks

• Overhead can be reduced if a task is detected to be restricted to only one core (e.g. via cpus_allowed).

– However, interacting with cpus_allowed directly in the fast-path can be relatively expensive.

• A key observation is that cpus_allowed is updated infrequently

– Therefore, compute the hamming weight of cpus_allowed at update time

– cache it in the task_struct as a speed/memory tradeoff

• weight == 1 means the task can skip all migration related checks

22

Cache Topology Awareness

• Modern multi-core CPUs have an inherent hierarchy w.r.t. the cache topology.

– E.g. cpus within a socket or cores on a die may share cache resources

• Linux already knows how to nominally interpret and represent these relationships (called sched-domains).

• We reference this topology information to make the best routing decisions for RT tasks.

– Tasks will migrate as close to their respective cache-hot data as possible.

23

Topology Awareness (cont)Buddy-wake

• Tasks that have a tight coupled relationship (e.g. producer/consumer) will tend to run best when topologically “close” to one another in the hierarchy.

• 2.6.25 introduced a “buddy-wake” algorithm to CFS tasks by overload wake_sync() operations to indicate a possible coupling.

• Extending this algorithm to support RT tasks as well boosts IO performance for RT tasks by over 25%.

24

Topology Awareness (cont)Bi-directional Affinity

• The kernel already has the notion of cpu affinity. – A task will tend to run on the same CPU across different

scheduling horizons

• It will also attempt to establish a relationship between a waker and wakee

– Move the wakee to the same cpu as the waker (i.e. buddy)

• However, the “waker” cpu may not always be ideal– Perhaps the waker is fairly loaded, but a topological neighbor

to the waker is idle– Performance is gained by moving the wakee closer to the

waker, but not all the way to the waker itself.

25

Priority Inheritance• Created generic PI library

– Existing PI logic is entwined in rtmutex library– Break out the logic into a separate libpi library for general use

• New uses for PI API– Pi-workqueues (upstream)

> Priortizable work-items can boost the kthread

– pi-waitqueues (in development)> Tasks blocked on a resource will update their queue position with priority

changes. Highest wakes first.

– Futexes ?> Futexes currently use rtmutex-proxy infrastructure, but could be adapted> Possible applications for futex requeue_pi problem.

26

Lateral Lock Steal

• Priority-queued pi-mutexes grant the lock to the next waiter on release as PENDING

– If a high-priority thread contends on the lock in the PENDING state, it can preempt the next-waiter in favor of itself.

• New algorithm extends this “steal” to tasks of equal priority

– The observation is that it is more efficient to allow the already running thread to take the lock than to sleep a perfectly runnable task of equal priority

• Currently only supports SCHED_OTHER tasks to avoid an unbounded latency

– Can be extended to RT tasks with some additional work

27

Adaptive Locks

• Non-RT uses normal spinlocks throughout the kernel, whereas RT converts most of these to a pi-mutex.

– Non-contended pi-mutex ~ equivalent to a normal spinlock– Contention enters the slow path and ultimately sleeps

• The observation is that most original uses of spinlocks are “short-hold” locks.

– Therefore, even contended locks will tend to release shortly after the contention is detected

– Sleeping the waiter is generally wasted thrashing

• Solution: develop a preemption-enabled friendly adaptive spin/sleep algorithm for pi-mutexes.

28

PI Optimization of RT-mutex

• Observation: Most lock acquisitions occur either via fast-path or adaptive-spin

– However, all slow-path acquisitions (including adaptive) pi-boost the owner regardless of whether it was needed

– Pi-boosting adds considerable overhead

• Therefore, we can optimize the slow-path to only pi-boost when necessary. We must boost the owner if:

– the waiter must adaptively sleep– the owner-prio is < waiter->prio

• Otherwise, we can skip the overhead and simply monitor our priority environment for changes

29

Priority Inheritance Read/Write Locks

• Priority Inheritance is complex• Current approach is one owner : one lock• RW Locks (and RW semaphores) have multiple owners (only one writer, but many readers)

• Limit RW locks owners (Default 1 per CPU)• /proc/sys/kernel/rwlock_reader_limit

30

Questions?

31

Processor Quantum (PQ)

• CFS “load” is irrelevant for RT tasks– RT-tasks/IRQs can run as frequent and as long as they wish

w.r.t. CFS tasks.– Load mis-accounting causes problems when intermixing RT

tasks with CFS tasks– Intermixing tasks happens very frequently in PREEMPT_RT,

but can happen in mainline as well

• PQ tracks the amount of “computational timeslices” available for tasks at or below a given priority

• Load can be scaled up or down depending on PQ rating to properly account for RT tasks

32

Networking Enhancements• Observation: network IO causes lots of context switching

– Solution: Collapse NAPI bottom-half processing into the (threaded) IRQ

> Convert NET-RX softirq to a workqueue to take advantage of existing infrastructure

> Deferred-workitems allow for an easy “Software interrupt coalescing” feature to drop in

» Can schedule the workitem in the future independent of hardware support for coalescing

– Support ingress prioritization through the entire stack> Extend NAPI to allow queue prioritization end-to-end with application> Extend workqueue API to accept prio parameter instead of inheriting from

the calling task

Documents

Real-Time Throughput · > Threaded interrupts increase context switching rates > Contending on a pi-mutex is more complex than a simple spinlock, etc • Part of the reason is maturity