Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Mbench: Benchmarking a
Multicore Operating System
Using Mixed Workloads
Gang Lu and Xinlong Lin
Institute of Computing Technology,
Chinese Academy of Sciences
BPOE-6, Sep 4, 2015
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Fast evolution of hardware
Intel Xeon E5-4669 v3: 18 cores
Intel Xeon E5-4667 v3: 16 cores
If with 4 NUMA nodes, 72 and 64 cores, 144 and
128 threads
Booming big data applications
Backgrounds
2
More cores can
accommodate more applications
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Motivation
3
* Malte Schwarzkopf, Derek G. Murray, and Steven Hand. 2012. The seven deadly sins of cloud computing
research. HotCloud 2012.
“In the OS community, the lack of a representative
“desktop mix” benchmark prompted a call for better
multi-core benchmarks, and in a similar vein, we believe
the cloud computing community needs representative
cluster benchmarks.” ---The seven deadly sins of cloud computing research*
Actually, not only work in cloud computing, but also in
general data centers, we need new multi-core OS
benchmarks.
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Current OS benchmark suites
Focus on performance or scalability
4 single core -> multi-core -> many core
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Few used mixed workloads
5
What are OS researchers using?
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Few used mixed workloads
6
What are OS researchers using?
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
1999,Celluar Disco TPC-D + RayTrace
2003, Xen OSDB + SPEC WEB99 + dd + fork bomb
2005, K42 SPEC SDET + streaming applications
2009, HeliOS SAT solver + a disk indexer
2013, Tessellation video player + dropbox
7
What are OS researchers using?
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Why using mixed workloads
In industry, admins tend to consolidate
workloads to gain resource utilization
Virtualization systems
Even monolithic OSes like Linux
The performance curve of a single workload
can be largely impacted
Tail latency is quite sensitive
8
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Why using mixed workloads
The performance curve of a single
workload can be largely affected
9
(memcached, PARSEC.streamcluster)
Ratios of the performance of colocating workload A
with B to the performance of solo running of
workload A, which are denoted as (A, B).
Slowdown for mixed MOSBENCH workloads*.
(Background workload is gmake)
* Kuz Ihor,Anderson Zachary, Shinde Pravin, Roscoe Timothy. Multicore
OS benchmarks : we can do better. HotOS 2011.
Ideal curve
(running alone)
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Tail latency is sensitive
Average performance is not as
sensitive as worst performance
10
(Search, bodytracker) and (bodytracker, streamcluster)
Ratios of the performance of colocating workload A with B to the performance
of solo running of workload A, which are denoted as (A, B).
much sensitive
not sensitive
LXC: Linux Containers
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Mbench
Goal--Providing real mixed workloads
What properties?
Benchmark selection
Full coverage
Micro and application benchmarks
Latency critical workloads
Least redundancies
Experiment control
Tunable workloads composition
Performance monitoring and analyzing
11
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
What do we include in Mbench?
Summary of the benchmarks
12
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Micro benchmarks
Key principles
Covering each subsystem
cpu, memory, file system, network
Covering kernel functionalities
system calls, in-kernel mechanisms
13
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Micro benchmarks
SPEC CPU
14
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Micro benchmarks
cachebench
15
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Micro benchmarks
IOzone
16
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Micro benchmarks
netperf
17
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Micro benchmarks
Will-It-Scale
Iterations of: brk, dup, eventfd, fallocate, futex,
getppid, lock, lseek......
We extend it to support exporting latencies of
invocations
18
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Application benchmarks
Key principles
Real application
Different workload type
Different resource priorities
19
BigDataBench
is a good choice!
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
BigDataBench
With 33 workloads, different programming
models
Real-world big data
20 Lei Wang; Jianfeng Zhan, etc.. BigDataBench: A big data benchmark suite from internet services. HPCA’14
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Application benchmarks
Search
Front end: tomcat
Back end: nutch
21
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Application benchmarks
Hadoop
sort, grep
Spark
kmeans, pagerank 22
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Application benchmarks
PARSEC
small offline batches
23
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Application benchmarks
memcached
Derived from MOSBENCH*
Modified with tail latency and different running
models
24
* Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris,
and Nickolai Zeldovich. An analysis of Linux scalability to many cores. OSDI’10.
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Construction of Mbench
Application benchmarks
PostgreSQL
pgbench
25
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
How to use?
Two example problems
What workload mixes yield the most
information when used to evaluate a
multicore OS?
How are the services concurrent with
other applications?
26
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Use cases (1)
Micro benchmarks
27
Performance degradation of co-running two benchmarks on a single server. The
numbers 0∼10 on the axes denotes SPECCPU.{bzip2, sphix3}, cachebench.{read,
write, modify}, IOzone.{write, read, modify}, netperf.{tcp_stream, tcp_rr, tcp_crr},
respectively. The numbers in the grids are performance degradation percentage (%)
of a benchmark on y-axis interfered by a background benchmark on x-axis.
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Use cases (1)
Micro benchmarks
28
A new system
Performance degradation of co-running two benchmarks on a single server. The
numbers 0∼10 on the axes denotes SPECCPU.{bzip2, sphix3}, cachebench.{read,
write, modify}, IOzone.{write, read, modify}, netperf.{tcp_stream, tcp_rr, tcp_crr},
respectively. The numbers in the grids are performance degradation percentage (%)
of a benchmark on y-axis interfered by a background benchmark on x-axis.
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Use cases (2)
Application benchmarks
29
performance degradation of tail latency of the Search workload co-located with
the PARSEC benchmarks. Each time we run an individual benchmark at the
background. For the three kernels, their tail latencies of running Search at 300
req/s on a 12-core server are respectively 108.6, 134.5, 127.7 ms.
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Use cases (2)
Application benchmarks
30
performance degradation of tail latency of the Search workload co-located with
the PARSEC benchmarks. Each time we run an individual benchmark at the
background. For the three systems, their tail latencies of running Search at 300
req/s on a 12-core server are respectively 129.3, 210.9, and 128.5 ms.
A new system
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
What else should we do?
Experiment control
Controlling of mixed workloads is more difficult
selection, time synchronization
Resource allocation policies
CPU pining, NUMA allocation
Performance monitoring
Monolithic kernel, virtualization, multi-kernel
System & architecture levels
Log collecting and analysis
31
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
An experiment control tool
Mcontroller
32
Structure of the experiment control tool (Mcontroller)
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
What can Mcontroller do?
Experiment control
Customizable Controlling
workload selection, time series of each workload
batch running of multiple experiments
Resource allocation
CPU pining, NUMA allocation
Performance monitoring
Supports Linux, Linux Containers, Xen
Nearly all system & architecture characteristics
proc, perf, Oprofile
33
Interfaces are extensible!
Benchmarks can be easily added!
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Conclusions
Current OS benchmarks do not evaluate
performance isolation!
OS researchers choose mixed workloads at will
Mbench: an OS benchmark suite with
mixed workloads
micro & application
latency critical workloads
resource & workload types
We developed an experimental control tool
Supports many good features and is extensible
34
Thank you! Mahalo!