Upload
vudat
View
219
Download
0
Embed Size (px)
Citation preview
Accurate emulation of CPU performance
Tomasz Buchert1 Lucas Nussbaum2 Jens Gustedt1
1 INRIA Nancy – Grand Est
2 LORIA / Nancy - Universite
Validation of distributed systems
Approaches:Theoretical approach (paper and pencil)
, the most general results and understanding/ very hard (leads to unsolvability results)
Experimentation (real application on a real environment), realistic context, credibility/ difficulty of preparation and control, questionable reproducibility
Simulation (modeled application inside modeled environment), very simple and perfectly reproducible/ experimental bias, possibly unrealistic
Emulation (real application inside a modeled environment), control over the experiment parameters/ difficult
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 2 / 20
Emulation
The perfect emulated environment should emulate (independently):
Network bandwidth, latency, topologyPerformance and number of CPUsMemory capabilitiesBackground noise (network, CPU, faults)
Already implemented in Wrekavoc – a tool to define and controlheterogeneity of the cluster (but not perfect yet!)
In this talk, however, we specifically concentrate on
Emulation of CPU
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 3 / 20
Emulation
The perfect emulated environment should emulate (independently):
Network bandwidth, latency, topologyPerformance and number of CPUsMemory capabilitiesBackground noise (network, CPU, faults)
Already implemented in Wrekavoc – a tool to define and controlheterogeneity of the cluster (but not perfect yet!)
In this talk, however, we specifically concentrate on
Emulation of CPU
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 3 / 20
CPU emulation
Various elements of CPU architecture could be emulated:
speednumber of coressizes and properties of caches (and topology thereof)memory access speed (especially for NUMA systems)
In this talk, we will talk about
Degradation of CPU speed
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 4 / 20
CPU emulation
Various elements of CPU architecture could be emulated:
speednumber of coressizes and properties of caches (and topology thereof)memory access speed (especially for NUMA systems)
In this talk, we will talk about
Degradation of CPU speed
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 4 / 20
An example
50 %
Unused
CPU 1
50 %
Unused
CPU 2
70 %
Unused
CPU 3
30 %
Unused
CPU 4
(1) controlling speed of each CPU/core independently
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 5 / 20
An example (continued)
50 %
Unused
CPU 1
50 %
Unused
CPU 2
70 %
Unused
CPU 3
30 %
Unused
CPU 4
(2) being able to create separated scheduling zones
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 6 / 20
Dynamic frequency scaling (CPU-Freq)
AKA Intel Enhanced SpeedStep or AMD Cool’n’QuietHardware solution to reduce:
heatnoisepower usage
For:no overhead of emulationcompletely unintrusivemeaningful CPU time measure
Against:only a finite set of different frequency levels
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 7 / 20
CPU-Lim
Method available in WrekavocAlgorithm:
if CPU usage ≥ threshold → send SIGSTOP to the processif CPU usage < threshold → send SIGCONT to the process
CPU usage = CPU time of the processprocess lifetime
For:easy and almost POSIX-compliant
Against:intrusive and unscalabledecision based on one process instead of global CPU usagesleeping is indistinguishable from preemption
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 8 / 20
Fracas
Based on idea from KRASH (load injection tool) ideaUses Linux Cgroups and Completely Fair SchedulerA predefined portion of the CPU is given to tasks burning CPUAll other processes are given the remaining CPU time
Emulatedprocesses
CPU burner
Core 1
Emulatedprocesses
CPU burner
Core 2
Emulatedprocesses
CPU burner
Core 3
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 9 / 20
Fracas
Based on idea from KRASH (load injection tool) ideaUses Linux Cgroups and Completely Fair SchedulerA predefined portion of the CPU is given to tasks burning CPUAll other processes are given the remaining CPU time
For:unintrusivescalable
Against:unportable to other systemssensitive to the configuration of the scheduler
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 9 / 20
Fracas and latency of the scheduler
1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]
3.03.54.04.55.05.56.06.57.0
GFL
OP
/ s
0.1 ms1 ms10 ms100 ms1000 ms
The smaller the latency, the better the emulation
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 10 / 20
Evaluation
Based on different types of work:CPU intensive (Linpack benchmark)IO boundmultiprocessingmultithreadingmemory speed (STREAM benchmark)
X-axis – emulated frequencyY-axis – speed perceived by the benchmarkeach test repeated 10 times, results = average95% confidence interval using t-Student distributionEvaluation performed on Grid’5000 platform
nodes with two quad-core Intel Xeon X5570 processorsnodes with a pair of single-core AMD Opteron 252 processors
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 11 / 20
Grid’5000
9 sites, 1600 machinesLille, Rennes, Orsay, Nancy, Bordeaux, Lyon, Grenoble, Toulouse, Sophia
Dedicated to research on distributed systems and HPC
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 12 / 20
CPU intensive work
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]
3.03.54.04.55.05.56.06.5
GFL
OP
/ s
CPU-FreqCPU-Lim1Fracas
CPU-Lim is less predictable (the outcome has higher variance)
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 13 / 20
IO-bound work
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]
3500
4000
4500
5000
5500
6000Lo
ops
/ s
CPU-FreqCPU-Lim1Fracas
CPU-Lim gives (unfair) advantage to IO-bound tasks
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 14 / 20
Multiprocessing
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]
2000
4000
6000
8000
10000Lo
ops
/ sCPU-FreqCPU-Lim1Fracas
Fracas can’t emulate CPU for multitask computation
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 15 / 20
Multithreading
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]
2000
4000
6000
8000
10000Lo
ops
/ sCPU-FreqCPU-Lim1Fracas
CPU-Lim controls processes instead of scheduling entities
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 16 / 20
Memory speed
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0CPU Frequency [GHz]
5000
6000
7000
8000
9000
10000
11000G
B / s
CPU-FreqCPU-Lim1Fracas
Memory speed is affected differently by each method
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 17 / 20
Summary of the evaluation
CPU-Freq:very good resultscoarse granularity
CPU-Lim:not scalable due to implementation, intrusivehigher variancecontrols processes, not threads
Fracas:good behavior for a single-task workloadscalablebad behavior for multitask workload
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 18 / 20
Future work
Explore other approachesImprove Fracas to cover multitaskingEmulate memory bandwidthEmulate other aspects of CPUIntegrate Fracas into WrekavocTake over the world :)
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 19 / 20
Conclusions
Presented Fracas, a method for CPU performance emulationbased on Linux cgroupsCompared with CPU-Freq and CPU-Lim (Wrekavoc)Evaluated experimentally on Grid’5000None of the methods is perfect:
CPU-Freq: coarse grainedCPU-Lim: implementation problems, not scalableFracas: works perfectly in single thread/process case, needs workin multithread/process case
Questions?
Tomasz Buchert, Lucas Nussbaum and Jens Gustedt Accurate emulation of CPU performance 20 / 20