Upload
lyanh
View
224
Download
3
Embed Size (px)
Citation preview
EECS 750: Advanced Operating Systems
01/24/2014
Heechul Yun
1
Administrative
• Sign up presentations – Email me two papers that you want to present
– First-in-first-service
• The first paper reading begins next week – Borrowed-Virtual-Time (BVT) scheduling: supporting
latency-sensitive threads in a general-purpose scheduler, SOSP’99
– Email me the summary by 11:59 p.m., Sunday
– Please include “[EECS750]” in the subject line
2
Today
• In-depth introduction of overall topics
• Project ideas & available resources
3
Topics
• Performance
– How to manage Multicore, cache, DRAM, SSD, and GPU for good throughput/fairness/QoS?
• Power/Energy
– How to save power/energy while still getting enough performance?
• Reliability
– How to make system more predictable, less buggy?
– If you have bugs, how to find them, automatically?
4
5 H Sutter, “The Free Lunch Is Over”, Dr. Dobb's Journal, 2005(Updated in 2009)
Multicore
6
Server Desktop Mobile RT/Embedded
Multicore
• Lots of parallelism
• More performance at cheaper cost 7
NVIDIA Tegra K1 SoC: 4xCPU cores + 128 GPU cores + … (Source: nvidia.com)
Operating Systems Perspective
8
CPU
Unicore
T1 T2
Core1
Core2
Core3
Core4
Multicore
T1
T2
T3
T4
T5
T6
T7
T8
• Time-sharing
– Unicore: multiple tasks share a single processor
– Multicore: multiple tasks share multiple processors
Challenges: Shared Resources
9
CPU
Cache, DRAM, Disk
Unicore
T1 T2
Core1
Shared Cache, DRAM, Disk
Core2
Core3
Core4
Multicore
T1
T2
T3
T4
T5
T6
T7
T8
Performance Impact
Data Intensive Applications
10
• Multimedia processing, object tracking, game, big data(*), …
• More stress on the memory hierarchy
(*) Source: Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012
Inter-Core Memory Interference
• Significant slowdown: >6x slowdown on 4 cores
11
Slowdown ratio due to interference
Ru
nti
me
slo
wd
ow
n
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
Shared Memory
foreground
X-axis
Intel Xeon
L3 Cache
Core
background 470.lbm
Core Core Core
Inter-Core Memory Interference
• Performance depends on both cores and co-runners
12
L. Tang et al., “The Impact of Memory Subsystem Resource Sharing on Datacenter Applications”, ISCA’11,
Questions
• How to maximize overall throughput?
• How to provide QoS (Quality-of-Service)?
• How to guarantee performance?
• Time-sharing based scheduling is not sufficient in this multi-core era
• Need to deal with parallelism and shared resources
13
In This Course
• CPU scheduling (Week 2)
• Contention-aware scheduling (Week 3-4)
• Cache and DRAM management (Week 5-6)
• Shared Disk and GPU management (Week 7-8)
14
Power/Energy
• Mobile
– Battery powered devices
• Data center
– 61 billion kWh in 2006 (1.5% of total U.S electricity) *
15 (*)Source: EPA Report to Congress on Server and Data Center Energy Efficiency, 2007
Data Center Operating Cost Breakdown
16 Source: J Koomey et al, “ASSESSING TRENDS OVER TIME IN PERFORMANCE, COSTS, AND ENERGY USE FOR SERVERS”, 2009
Over 40% on energy
17
Power Consumption
• CPU and memory consumes significant power – Intel Core-i7 haswell CPU: 15W, 4G DDR3 DRAM module x 2: 10W
– Calxera EnergyCore (ARM server) CPU: 5W
Figure source: Luiz André Barroso and Urs Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool, 2009
Energy Reduction
• Hardware – Processing technology improvement
• 28nm 20nm
– Clock gating, power gating
– …
• Software – DVFS (Dynamic voltage and frequency scaling)
– DPM (Dynamic power management)
– …
18
Example of Previous Work (CPU and Memory DVFS)
19
CPU(Mhz) Mem(Mhz) Time(s) Energy(mJ)
200 100 3.46 1690
100 100 3.55 1182
Memxfer5b : memory benchmark program
Half of CPU clock
Energy saved 30%
Exec. time increased only 3%
Motivation
20
CPU(Mhz) Mem(Mhz) Time(s) Energy(mJ)
200 100 4.26 2364
200 50 4.28 2106
Dhrystone: CPU benchmark program
Half of Mem clock
Energy saved 10%
Exec time increased only 0.05%
Energy Equation and Validation
Capacitance (nF) Power (mW)
Kca Kcs Kma* Kms* I R
0.505 0.224 0.540 0.210 6.570 67.434
21
)()(
)()( 2*22*2
ePRI
f
MRfVkfVk
f
CRfVkfVkE
m
mmaccpucs
c
mmscca
Obtained coefficients in the energy equation
• Validated on a ARM926-ejs based platform
Energy vs. Utilization
22
Task set cache stall ratio (MH/(CH+MH) ): 0.3
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
MAX
CPU-only
Static
utilization
No
rmal
ized
ave
rage
po
wer
co
nsu
mp
tio
n
MultiDVFS
In This Course
• How to save power/energy? (Week10)
– Mobile device level
– Server cluster level
23
Reliability
• Multi-threaded applications
– Hard to program
– Easy to produce subtle bugs
– Stress-testing is not effective and time consuming
– Cost lots of $$$
24
State Space Explosion
• Example
Initially: V1 = V2 = … = V10 = 0
Thread 1
1: V1 = 1 2: stop
Thread 2
1: V2 = 1 2: stop
Thread 10
1: V10 = 1 2: stop
3,628,800 (10!) interleavings, 1024 (2^10) states
Exhaustive testing is impractical
(*) Slide from Godefroid’s talk in 2004 at PASTE with minor modification for our purpose
In This Course
• How to find bugs? (Week 11)
– Data race detection, atomicity violation detection, systematic testing and model checking
• How to prevent bugs? (Week 12)
– Deterministic runtimes
26
Project
• Available Resources
– BeagleBone Black x1, Odroid-XU-E x1, Samsung ARM Chromebook x1, Nexus 7 (1st gen) x1
– I can buy up to $300 equipment for your project
27
Project Ideas
• DRAM Bank-aware user-level malloc library
– Goal: control memory allocations over DRAM banks
• can save energy by packing all your allocations in one bank.
• can reduce worst-case latency by assigning dedicated banks for latency critical applications
– Read
• ULCC: A User-Level Facility for Optimizing Shared Cache Performance on Multicores,
• PALLOC, Jantz’s VEE paper
– I can provide simple malloc source code to start from
28
Project Ideas
• Page recoloring kernel patch (& tool)
– Re-map already allocated pages of certain colors (or DRAM banks) to different ones
– Read
• Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems
29
Project Ideas
• Comprehensive power/performance analysis
– Investigate the performance impact of CPU DVFS, memory DVFS, and DPM on ARM development boards or recent Intel CPU with energy counters
– Develop a better measurement tool
– Read
• An undergraduate individual study
• http://web.eece.maine.edu/~vweaver/projects/rapl/
• MultiDVFS paper (ECRTS'10)
30