Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
© 2009 IBM Corporation
IBM Linux Technology Center
Tweaking Linux for a Green Datacenter
Vaidyanathan Srinivasan <[email protected]>Jenifer Hopper <[email protected]>
IBM Linux Technology Center
© 2009 IBM Corporation2
Agenda
Platform features and Linux exploitation Tuning scheduler and cpufreq Saving power in an idle system Saving power in under utilized system NUMA constraints for power management Can visualization save power? Power trending and power capping
IBM Linux Technology Center
© 2009 IBM Corporation
Energy Saving features in hardware
Dynamic frequency and voltage scaling Predefined set of frequency and corresponding voltage states
Dual core and quad core CPUs may share power domains and clock distribution
Sleep states at idle Sleep states with low latency
Deep sleep states (more power savings, higher latencies)
Choice of sleep states (wakeup latency vs power savings)
Package level sleep states
IBM Linux Technology Center
© 2009 IBM Corporation
Energy Saving features in Linux kernel
OnDemand frequency scaling (cpufreq) Tickless kernel (NO_HZ) Deferrable timers (for kernel code) Scheduler domains, CPU affinity, cpusets Multi core power saving heuristics sched_mc_power_savings Support for deep sleep states Latency based selection of various low power deep sleep states
(cpuidle governor) Device power management infrastructure (USB,PCI,...)
IBM Linux Technology Center
© 2009 IBM Corporation
Idle CPU Power Management
CPU 0 CPU 1 CPU 2 CPU 3
P0
P1
P2 P3 P4
Low systemutilization
Consolidate workloads
CPU 0 CPU 1 CPU 2 CPU 3
P0
P1
P2
P3
P4
zzZzzZTickless kernel helps idle CPU to sleep longer
Move process, timers and interrupts
IBM Linux Technology Center
© 2009 IBM Corporation6
Power saving enhancements
No Powersaving
Sleep state
Dynamic Voltage and Frequency Scaling + Sleep
DVFS + Sleep + Tickless
DVFS + Sleep + Tickless + sched_mc=1
DVFS + Sleep + Tickless + sched_mc=2
82 84 86 88 90 92 94 96 98 100
Power savings at Idle
Normalised Average Pow er
Fe
atu
re
No Powersaving
Sleep state
Dynamic Voltage and Frequency Scaling + Sleep
DVFS + Sleep + Tickless
DVFS + Sleep + Tickless + sched_mc=1
DVFS + Sleep + Tickless + sched_mc=2
82 84 86 88 90 92 94 96 98 100
Power savings at 50% load
Normalised Average Power
Fe
atu
re
Across the stack: hardware, firmware and Linux kernel
Approximate power savings percentages obtained across different experiments and hardware platforms
IBM Linux Technology Center
© 2009 IBM Corporation7
OnDemand CPU Frequency switching
IBM Linux Technology Center
© 2009 IBM Corporation8
CPU Task consolidation -- Kernbench
IBM Linux Technology Center
© 2009 IBM Corporation9
CPU Task consolidation -- SPECPower
IBM Linux Technology Center
© 2009 IBM Corporation10
CPU Task consolidation – Use sibling threads
IBM Linux Technology Center
© 2009 IBM Corporation11
Saving power in Idle system
Optimize applications to reduce wake-ups at idle Increase low power sleep state residencyPowerTOP 1.11 (C) 2007, 2008 Intel Corporation
Cn Avg residencyC0 (cpu running) ( 0.8%)polling 0.0ms ( 0.0%)C1 mwait 13.3ms ( 1.1%)C3 mwait 46.1ms (98.0%)P-states (frequencies) 2.93 Ghz 0.0% 2.80 Ghz 0.0% 2.67 Ghz 0.0% 2.53 Ghz 0.0% 1.60 Ghz 100.0%
Wakeups-from-idle per second : 22.1 interval: 235.0sno ACPI power usage estimate availableTop causes for wakeups: 49.0% ( 67.4) <interrupt> : extra timer interrupt 20.7% ( 28.5) <kernel IPI> : Rescheduling interrupts 6.5% ( 8.9) java : sk_reset_timer (tcp_write_timer) 5.8% ( 8.0) <kernel core> : cpucache_init (delayed_work_timer_fn) 4.3% ( 5.9) java : sk_reset_timer (tcp_delack_timer) 3.6% ( 5.0) java : futex_wait (hrtimer_wakeup) 3.3% ( 4.6) <interrupt> : eth0
IBM Linux Technology Center
© 2009 IBM Corporation12
Optimised idle load balancer
One CPU among the idle CPUs run sched tick and watch for overload from busy cpus.
Choosing this idle load balancer from a semi-idle CPU package will allow other CPU packages to go completely idle
Core 0 Core 1
Core 2 Core 3
Core 4 Core 5
Core 6 Core 7
Busy CPU running task Idle CPUs in deep sleep
Idle load balancer running loadbalance and sched tick
Move idle load balancer to semi-idle CPU package
zzZ Package Deep sleep state
IBM Linux Technology Center
© 2009 IBM Corporation13
Timer migration
Tasks can be consolidated using sched_mc framework Interrupts can be consolidated using the user space irqbalancer
daemon Migrating timers from idle cpus to the idle-load-balancer cpu
coupled with optimized selection of idle-load-balancer cpu provides good results for package evacuation and increased deep sleep residency time
Consolidating timers to single CPU enables good overlap of timer expiry time and reduced total number of wakeups from idle
Range timer framework (from Arjan) will further help reduce wakeups from idle deep sleep state
IBM Linux Technology Center
© 2009 IBM Corporation14
Saving power in underutilized system
Understanding workloads: Number of software threads and their cpu utilization
Relation between threads – amount of data sharing
Latency sensitive vs throughput of workloads
Knobs available CPU frequency governors and policies
Scheduler tunables, shced_mc and sched smt_powersavings
CPU idle governor and PM QoS framework
IBM Linux Technology Center
© 2009 IBM Corporation15
Onchip memory controller make each package a NUMA node Task consolidation may increase memory latency for tasks
Constraints in package level power management
Performance tradeoffs are very sensitive work workloads
NUMA Constraints for power management
IBM Linux Technology Center
© 2009 IBM Corporation16
Visualization can improve system utilization Operate system at better power efficiency
VM guest configurations and resource allocations can be optimized
Power saving optimizations within guest is limited Hypervisor needs to coordinate policies across guests
Can Virtualization save power?
IBM Linux Technology Center
© 2009 IBM Corporation17
Idle Virtual machines:PowerTOP 1.10 (C) 2007, 2008 Intel Corporation
Collecting data for 15 seconds
Cn Avg residencyC0 (cpu running) (48.2%)C0 0.0ms ( 0.0%)C1 halt 0.0ms ( 0.0%)C2 0.1ms ( 0.5%)C3 0.3ms (51.3%)P-states (frequencies) 2.17 Ghz 24.1% 1.67 Ghz 0.9% 1333 Mhz 6.1% 1000 Mhz 69.0%Wakeups-from-idle per second : 1834.7 interval: 15.0sno ACPI power usage estimate availableTop causes for wakeups: 25.4% (1268.1) <kernel IPI> : function call interrupts 20.0% (1000.1) kvm : __kvm_migrate_pit_timer (pit_timer_fn) 19.8% (990.9) kvm : __kvm_migrate_apic_timer (apic_timer_fn) 15.1% (752.1) <kernel IPI> : Rescheduling interrupts 5.7% (285.1) <interrupt> : PS/2 keyboard/mouse/touchpad 3.7% (183.9) <interrupt> : iwl3945 3.6% (178.5) firefox : futex_wait (hrtimer_wakeup) 1.7% ( 86.1) opera : schedule_timeout (process_timeout)
PowerTOP 1.9 (C) 2007 Intel Corporation
Collecting data for 15 seconds< Detailed C-state information is only available on Mobile CPUs (laptops) >P-states (frequencies)Wakeups-from-idle per second : 256.4 interval: 15.0sTop causes for wakeups: 97.5% (996.7) <interrupt> : extra timer interrupt 1.3% ( 12.9) <interrupt> : ata_piix 0.2% ( 2.2) <interrupt> : PS/2 keyboard/mouse/touchpad 0.2% ( 1.7) gnome-terminal : schedule_timeout (process_timeout) 0.1% ( 1.1) setroubleshootd : schedule_timeout (process_timeout) 0.1% ( 1.0) im-info-daemon : do_nanosleep (hrtimer_wakeup) 0.1% ( 0.5) <interrupt> : eth0 0.1% ( 0.5) <kernel core> : e1000_intr (e1000_watchdog) 0.1% ( 0.5) hald-addon-stor : schedule_timeout (process_timeout) 0.1% ( 0.5) firefox : futex_wait (hrtimer_wakeup)
KVM Guest
KVMHOST
IBM Linux Technology Center
© 2009 IBM Corporation18
Power trending and capping tools:IBM Active Energy Manager
IBM Linux Technology Center
© 2009 IBM Corporation19
Power trending and capping tools: pwrkap
http://pwrkap.sourceforge.net/
IBM Linux Technology Center
© 2009 IBM Corporation20
Questions ?
Acknowledgments
Arun R Bhardwaj Darrick J Wong Gautham Shenoy Jeffery J Heroux Naren Devaiah Premalatha M Nair Susanne Libischer
IBM Linux Technology Center
© 2009 IBM Corporation21
Reference
OLS 2008: Energy aware task and interrupt management http://ols.fedoraproject.org/OLS/Reprints-2008/srinivasan1-reprint.pdf
OLS PM Mini summit http://lwn.net/Articles/292447/ sched_mc=2 framework - 2.6.29
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=afb8a9b70b86866a60e08b2956ae4e1406390336 sched_mc=2 for Nehalem http://lkml.org/lkml/2009/3/3/109 Optimised Idle Load balancer http://lkml.org/lkml/2008/9/23/82 Timer migration http://lwn.net/Articles/320152/, http://lkml.org/lkml/2009/3/4/130 Pwrkap http://pwrkap.sourceforge.net/ Active Energy Manager
http://www-03.ibm.com/systems/power/hardware/whitepapers/energyscale.html
Thank You
IBM Linux Technology Center
© 2009 IBM Corporation22
Legal Statements
● Copyright International Business Machines Corporation 2009.● Permission to redistribute in accordance with Linux Foundation Collaboration
Summit submission guidelines is granted; all other rights reserved.● This work represents the view of the authors and does not necessarily
represent the view of IBM or Intel.● IBM, IBM logo, ibm.com are trademarks of International Business Machines
Corporation in the United States, other countries, or both.● Intel is a trademark or registered trademark of Intel Corporation or its
subsidiaries in the United States and other countries.● Linux is a registered trademark of Linus Torvalds in the United States, other
countries, or both.● Other company, product, and service names may be trademarks or service
marks of others.● References in this publication to IBM products or services do not imply that
IBM intends to make them available in all countries in which IBM operates.
IBM Linux Technology Center
© 2009 IBM Corporation23
Legal Statements
● INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.