23
© 2009 IBM Corporation IBM Linux Technology Center Tweaking Linux for a Green Datacenter Vaidyanathan Srinivasan <[email protected]> Jenifer Hopper <[email protected]>

Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

© 2009 IBM Corporation

IBM Linux Technology Center

Tweaking Linux for a Green Datacenter

Vaidyanathan Srinivasan <[email protected]>Jenifer Hopper <[email protected]>

Page 2: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation2

Agenda

Platform features and Linux exploitation Tuning scheduler and cpufreq Saving power in an idle system Saving power in under utilized system NUMA constraints for power management Can visualization save power? Power trending and power capping

Page 3: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation

Energy Saving features in hardware

Dynamic frequency and voltage scaling Predefined set of frequency and corresponding voltage states

Dual core and quad core CPUs may share power domains and clock distribution

Sleep states at idle Sleep states with low latency

Deep sleep states (more power savings, higher latencies)

Choice of sleep states (wakeup latency vs power savings)

Package level sleep states

Page 4: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation

Energy Saving features in Linux kernel

OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets Multi core power saving heuristics sched_mc_power_savings Support for deep sleep states Latency based selection of various low power deep sleep states

(cpuidle governor) Device power management infrastructure (USB,PCI,...)

Page 5: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation

Idle CPU Power Management

CPU 0 CPU 1 CPU 2 CPU 3

P0

P1

P2 P3 P4

Low systemutilization

Consolidate workloads

CPU 0 CPU 1 CPU 2 CPU 3

P0

P1

P2

P3

P4

zzZzzZTickless kernel helps idle CPU to sleep longer

Move process, timers and interrupts

Page 6: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation6

Power saving enhancements

No Powersaving

Sleep state

Dynamic Voltage and Frequency Scaling + Sleep

DVFS + Sleep + Tickless

DVFS + Sleep + Tickless + sched_mc=1

DVFS + Sleep + Tickless + sched_mc=2

82 84 86 88 90 92 94 96 98 100

Power savings at Idle

Normalised Average Pow er

Fe

atu

re

No Powersaving

Sleep state

Dynamic Voltage and Frequency Scaling + Sleep

DVFS + Sleep + Tickless

DVFS + Sleep + Tickless + sched_mc=1

DVFS + Sleep + Tickless + sched_mc=2

82 84 86 88 90 92 94 96 98 100

Power savings at 50% load

Normalised Average Power

Fe

atu

re

Across the stack: hardware, firmware and Linux kernel

Approximate power savings percentages obtained across different experiments and hardware platforms

Page 7: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation7

OnDemand CPU Frequency switching

Page 8: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation8

CPU Task consolidation -- Kernbench

Page 9: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation9

CPU Task consolidation -- SPECPower

Page 10: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation10

CPU Task consolidation – Use sibling threads

Page 11: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation11

Saving power in Idle system

Optimize applications to reduce wake-ups at idle Increase low power sleep state residencyPowerTOP 1.11 (C) 2007, 2008 Intel Corporation

Cn Avg residencyC0 (cpu running) ( 0.8%)polling 0.0ms ( 0.0%)C1 mwait 13.3ms ( 1.1%)C3 mwait 46.1ms (98.0%)P-states (frequencies) 2.93 Ghz 0.0% 2.80 Ghz 0.0% 2.67 Ghz 0.0% 2.53 Ghz 0.0% 1.60 Ghz 100.0%

Wakeups-from-idle per second : 22.1 interval: 235.0sno ACPI power usage estimate availableTop causes for wakeups: 49.0% ( 67.4) <interrupt> : extra timer interrupt 20.7% ( 28.5) <kernel IPI> : Rescheduling interrupts 6.5% ( 8.9) java : sk_reset_timer (tcp_write_timer) 5.8% ( 8.0) <kernel core> : cpucache_init (delayed_work_timer_fn) 4.3% ( 5.9) java : sk_reset_timer (tcp_delack_timer) 3.6% ( 5.0) java : futex_wait (hrtimer_wakeup) 3.3% ( 4.6) <interrupt> : eth0

Page 12: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation12

Optimised idle load balancer

One CPU among the idle CPUs run sched tick and watch for overload from busy cpus.

Choosing this idle load balancer from a semi-idle CPU package will allow other CPU packages to go completely idle

Core 0 Core 1

Core 2 Core 3

Core 4 Core 5

Core 6 Core 7

Busy CPU running task Idle CPUs in deep sleep

Idle load balancer running loadbalance and sched tick

Move idle load balancer to semi-idle CPU package

zzZ Package Deep sleep state

Page 13: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation13

Timer migration

Tasks can be consolidated using sched_mc framework Interrupts can be consolidated using the user space irqbalancer

daemon Migrating timers from idle cpus to the idle-load-balancer cpu

coupled with optimized selection of idle-load-balancer cpu provides good results for package evacuation and increased deep sleep residency time

Consolidating timers to single CPU enables good overlap of timer expiry time and reduced total number of wakeups from idle

Range timer framework (from Arjan) will further help reduce wakeups from idle deep sleep state

Page 14: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation14

Saving power in underutilized system

Understanding workloads: Number of software threads and their cpu utilization

Relation between threads – amount of data sharing

Latency sensitive vs throughput of workloads

Knobs available CPU frequency governors and policies

Scheduler tunables, shced_mc and sched smt_powersavings

CPU idle governor and PM QoS framework

Page 15: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation15

Onchip memory controller make each package a NUMA node Task consolidation may increase memory latency for tasks

Constraints in package level power management

Performance tradeoffs are very sensitive work workloads

NUMA Constraints for power management

Page 16: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation16

Visualization can improve system utilization Operate system at better power efficiency

VM guest configurations and resource allocations can be optimized

Power saving optimizations within guest is limited Hypervisor needs to coordinate policies across guests

Can Virtualization save power?

Page 17: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation17

Idle Virtual machines:PowerTOP 1.10 (C) 2007, 2008 Intel Corporation

Collecting data for 15 seconds

Cn Avg residencyC0 (cpu running) (48.2%)C0 0.0ms ( 0.0%)C1 halt 0.0ms ( 0.0%)C2 0.1ms ( 0.5%)C3 0.3ms (51.3%)P-states (frequencies) 2.17 Ghz 24.1% 1.67 Ghz 0.9% 1333 Mhz 6.1% 1000 Mhz 69.0%Wakeups-from-idle per second : 1834.7 interval: 15.0sno ACPI power usage estimate availableTop causes for wakeups: 25.4% (1268.1) <kernel IPI> : function call interrupts 20.0% (1000.1) kvm : __kvm_migrate_pit_timer (pit_timer_fn) 19.8% (990.9) kvm : __kvm_migrate_apic_timer (apic_timer_fn) 15.1% (752.1) <kernel IPI> : Rescheduling interrupts 5.7% (285.1) <interrupt> : PS/2 keyboard/mouse/touchpad 3.7% (183.9) <interrupt> : iwl3945 3.6% (178.5) firefox : futex_wait (hrtimer_wakeup) 1.7% ( 86.1) opera : schedule_timeout (process_timeout)

PowerTOP 1.9 (C) 2007 Intel Corporation

Collecting data for 15 seconds< Detailed C-state information is only available on Mobile CPUs (laptops) >P-states (frequencies)Wakeups-from-idle per second : 256.4 interval: 15.0sTop causes for wakeups: 97.5% (996.7) <interrupt> : extra timer interrupt 1.3% ( 12.9) <interrupt> : ata_piix 0.2% ( 2.2) <interrupt> : PS/2 keyboard/mouse/touchpad 0.2% ( 1.7) gnome-terminal : schedule_timeout (process_timeout) 0.1% ( 1.1) setroubleshootd : schedule_timeout (process_timeout) 0.1% ( 1.0) im-info-daemon : do_nanosleep (hrtimer_wakeup) 0.1% ( 0.5) <interrupt> : eth0 0.1% ( 0.5) <kernel core> : e1000_intr (e1000_watchdog) 0.1% ( 0.5) hald-addon-stor : schedule_timeout (process_timeout) 0.1% ( 0.5) firefox : futex_wait (hrtimer_wakeup)

KVM Guest

KVMHOST

Page 18: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation18

Power trending and capping tools:IBM Active Energy Manager

Page 19: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation19

Power trending and capping tools: pwrkap

http://pwrkap.sourceforge.net/

Page 20: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation20

Questions ?

Acknowledgments

Arun R Bhardwaj Darrick J Wong Gautham Shenoy Jeffery J Heroux Naren Devaiah Premalatha M Nair Susanne Libischer

Page 21: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation21

Reference

OLS 2008: Energy aware task and interrupt management http://ols.fedoraproject.org/OLS/Reprints-2008/srinivasan1-reprint.pdf

OLS PM Mini summit http://lwn.net/Articles/292447/ sched_mc=2 framework - 2.6.29

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=afb8a9b70b86866a60e08b2956ae4e1406390336 sched_mc=2 for Nehalem http://lkml.org/lkml/2009/3/3/109 Optimised Idle Load balancer http://lkml.org/lkml/2008/9/23/82 Timer migration http://lwn.net/Articles/320152/, http://lkml.org/lkml/2009/3/4/130 Pwrkap http://pwrkap.sourceforge.net/ Active Energy Manager

http://www-03.ibm.com/systems/power/hardware/whitepapers/energyscale.html

Thank You

Page 22: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation22

Legal Statements

● Copyright International Business Machines Corporation 2009.● Permission to redistribute in accordance with Linux Foundation Collaboration

Summit submission guidelines is granted; all other rights reserved.● This work represents the view of the authors and does not necessarily

represent the view of IBM or Intel.● IBM, IBM logo, ibm.com are trademarks of International Business Machines

Corporation in the United States, other countries, or both.● Intel is a trademark or registered trademark of Intel Corporation or its

subsidiaries in the United States and other countries.● Linux is a registered trademark of Linus Torvalds in the United States, other

countries, or both.● Other company, product, and service names may be trademarks or service

marks of others.● References in this publication to IBM products or services do not imply that

IBM intends to make them available in all countries in which IBM operates.

Page 23: Tweaking Linux for a Green Datacenter · OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets

IBM Linux Technology Center

© 2009 IBM Corporation23

Legal Statements

● INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.