Upload
trinhnga
View
216
Download
0
Embed Size (px)
Citation preview
A High Performance C++ Generic Benchmark for Computational EpidemiologyAniket Pugaonkar, Sandeep Gupta, Keith R. Bisset and Madhav V. Marathe {aniketnp, sandeep, kbisset, mmarathe}@vbi.vt.edu
The Network Dynamics and Simulation Science Laboratory, Virginia Tech, USA
HPL or Top500 benchmark is the most widely recognized and discussed metric for high performance computing systems. Other widely known benchmarks: NPB, HPCC, SPEC, EuroBen.
Application specific benchmarking is essential for two reasons –(a) to find better correlation between an application and the machine running it and (b) to help choose the most appropriate hardware-software configuration for given application and system parameters.
The Graph 500 [3] benchmark was developed for applications with graphs as their core analytical workloads.
Boost Graph Library[4] is based on Boost C++ framework and is based on generic programming principles. PBGL[6] is distributed graph library built by lifting, i.e., providing distributed implementation for various interfaces and operators in BGL.
INTRODUCTION
CONTAGION MODEL
CHALLENGES AND GOALS
Kernel 0: Create Person-Location Activity List.
Kernel 1: Construct a Person-Location Graph.
Kernel 2: Construct a Person-Person Graph.
Kernel 3: Assign locations to location groups in Person-Location Graph.
Kernel 4: Run the activity based contagion process over Person-Location graph. The computational complexity of the kernel is comparable to that of EpiSimdemics Algorithm [1].
Kernel 5: Run the contact based contagion process over person-person graph. The computational complexity of this kernel is comparable to that of EpiFast Algorithm [2].
[1] Christopher L. Barrett, et al. Episimdemics: An efficient algorithm for simulating the spread of infectious disease over large realistic social networks. (SC2008)
[2] Keith R. Bisset, et. al. EpiFast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems. (ICS '09). ACM, New York, NY, USA, 430-439.
[3] Murphy, Richard C., et al. "Introducing the graph 500" Cray User’s Group(2010)
[4] Lee, Lie-Quan, and Andrew Lumsdaine. The Boost graph library: user guide and reference manual. Addison-Wesley Professional, 2002.
[5] Chuck Pheatt. 2008. Intel® threading building blocks. J. Comput. Sci. Coll. 23, 4 (April 2008), 298-298.
[6] Gregor, Douglas, et al. "The Parallel Boost Graph Library." The Trustees of Indiana University (2005).
[7] M. Heroux and J. Dongarra . Towards a New Metric for Ranking High Performance Computing Systems, UTK EECS and Sandia National Labs Report SAND2013-4744, June 2013.
We design a benchmark consisting of several kernels which capture the essential compute, communication, and data access patterns of high performance contagion-diffusion simulations used in computational networked epidemiology. The goal is to (a) derive alternative implementations for computing the contagion by combining different implementation of the kernels, and (b) evaluate which combination of implementation, runtime, and hardware is most effective in running large-scale contagion diffusion simulations.
Our proposed benchmark is designed using C++ generic programming primitives and lifting sequential strategies for parallel computations. Together these lead to a succinct description of the benchmark and significant code reuse when deriving strategies for new hardware. These aspects are crucial for an effective benchmark because the potential combination of hardware and runtimes are growing rapidly thereby making infeasible to write an optimized strategy for the complete contagion diffusion from ground up for each compute system.
Overall Metric : Total Interactions Per Second
• Number of Interactions for a given network are independent of disease parameters.
• 𝐓𝐈𝐏𝐒 =𝑻𝒐𝒕𝒂𝒍 # 𝒊𝒏𝒕𝒆𝒓𝒂𝒄𝒕𝒊𝒐𝒏𝒔 𝒊𝒏 𝒕𝒉𝒆 𝒈𝒓𝒂𝒑𝒉
𝑻𝒐𝒕𝒂𝒍 𝒕𝒊𝒎𝒆 𝒓𝒆𝒒𝒖𝒊𝒓𝒆𝒅 (𝒖𝒔𝒊𝒏𝒈 𝒊 𝒕𝒉𝒓𝒆𝒂𝒅𝒔)
Speedup Strong Scaling (Speedup) for kernel 4 and 5 :
• 𝐒𝐩𝐞𝐞𝐝𝐮𝐩 =𝑻𝒊𝒎𝒆 𝒕𝒐 𝒓𝒖𝒏 𝒄𝒐𝒏𝒕𝒂𝒈𝒊𝒐𝒏 𝒐𝒏 𝒕𝒉𝒆 𝒈𝒓𝒂𝒑𝒉 𝒖𝒔𝒊𝒏𝒈 𝟏 𝒕𝒉𝒓𝒆𝒂𝒅
𝑻𝒊𝒎𝒆 𝒕𝒐 𝒓𝒖𝒏 𝒄𝒐𝒏𝒕𝒂𝒈𝒊𝒐𝒏 𝒐𝒏 𝒕𝒉𝒆 𝒈𝒓𝒂𝒑𝒉 𝒖𝒔𝒊𝒏𝒈 𝒏 𝒕𝒉𝒓𝒆𝒂𝒅𝒔
Overall time : Complete running time of benchmark
• The time to run the complete benchmark with a particular strategy for a particular hardware and runtime.
PERFORMANCE
RELATED STUDY
CHALLENGES
Complex algorithms used in real applications cannot be used as benchmarks because of intricate application parameters.
Designing and implementing strategies for new hardware limits code reuse and generic programming.
GOALS
Design kernels which capture the essential computation, communication and data access patterns in tools used to simulate spread of infectious disease through contagion models.
Develop and implement different evaluation strategies for kernelsfor existing and emerging hardware.
Evaluate the most effective combination of implementation, runtime and hardware for a given contagion.
REFERENCES
BENCHMARK METRICS
HIGH LEVEL SPECS
TASK BASED PARALLELISM
COMPLEX INTERVENTIONSKERNEL FLOW DIAGRAMS
A four node and five contact edge contact network with ΔtE = 0, ΔtI = 2 (in days) for each node and transmission probability 0.5 for each day. Node A is infectious at start. i. Day 0: A transmits the disease to B but not D.ii. Day 1: Both A and B are infectious. A transmits the disease to
D; B infects C but not D.iii. Day 2: A is removed, all others are infectious with no
susceptible nodes.iv. Nodes are removed gradually and on day 4, the system enters
a fixed point and stops evolving.The state transitions are one-way (from susceptible to exposed to infected to recovered) with no other possible transitions.
Node Interventions (NI) – Alter Vertex Properties of Persons• Antiviral, vaccinations etc.
• Change the infectivity or susceptibility of the person (see fig below).
Edge Interventions (EI) – Alter Edge Properties (Activities)• Location Closure – redirected to alternate locations.
• Activity Modification – Can alter the duration of activity or contact
period.
Intervention Type is also a Benchmark Parameter
Contagion Model is a Benchmark Parameter• SEIR, SIR, SIS, SEIS etc.
ri = Infectivity of person isj = Susceptibility of person jt = Disease transmissibility Pr(i->j) = Prob. of infection from i to j
CONCLUSIONS
Compute Platform Shadowfax BlueRidge
Processor Xeon E7-4860 Xeon E5-2670
L3 Cache 24 MB 20 MB
# Cores 10 8
#Threads 20 16
Clock 2.26 GHz 2.6 GHz
QPI Speed 6.4 GT/s (1link) 8 GT/s (2links)
#sockets per node 4 2
Total Cores 40 (4x10) 16 (2x8)
Person-Location Graph Graph 1 Graph 2 Graph 3 Graph 4
# Persons 215 216 217 218
# Locations 8169 16321 32339 64421
# edges (activities) 163464 327314 654690 1308386
# interactions (millions) 10 20.5 41.2 82.7
Person-Person Graph Graph 1 Graph 2 Graph 3
# Persons 217 218 219
# edges (activities) 2839404 5676741 11363809
# interactions (millions) 17 34.5 69.4
0
2
4
6
8
10
1 4 6 8 12 16
Spe
ed
up
Number of Threads
BlueRidge
Graph 1 Graph 2 Graph 3 Graph 4
0
0.5
1
1.5
2
2.5
3
3.5
1 4 6 8 12 16
Spe
ed
up
Number of Threads
BlueRidge
Graph 1 Graph 2 Graph 3
0
0.5
1
1.5
2
1 4 6 8 12 16 24 32
Spe
ed
up
Number of Threads
Shadowfax
Graph 1 Graph 2 Graph 3
0
1
2
3
4
5
6
7
1 4 6 8 12 16 24 32
Spe
ed
up
Number of Threads
Shadowfax
Graph 1 Graph 2 Graph 3 Graph 4
Strong Scaling : Kernel 4 Performance
Strong Scaling : Kernel 5 Performance
0
1
2
3
4
5
6
7
8
9
6 12 16 24
Spe
ed
up
fo
rit
hre
ads
Number of Threads
Shadowfax BlueRidge
Graph 3217
Graph 1215
Graph 4218
Graph 2216
0
0.5
1
1.5
2
2.5
3
3.5
6 12 16
Spe
ed
up
fo
r i t
hre
ads
Number of Threads
Shadowfax BlueRidge
Graph 1217
Graph 1218
Graph 1219
CONTRIBUTIONS
Develop kernel specifications and metrics for our benchmark.
Provide generic implementation of kernels in C++.
Provide generic kernels for agent based and contact based contagion models. The kernels capture the computational complexity (and not semantics) of algorithms [1],[2].
Develop scalable shared and distributed memory generic implementation of kernels using task based parallelism and message passing interface.
*Images adapted from [2]
Compute Platform – a stack of hardware, runtime and approach.
Contagion-diffusion – an agent based simulation.
Person – models the agent.
Location – spatial region.
Activity – a person visiting a location for a time period.
Interaction – two people at same location with overlap in time periods.
Contagion – spread of disease among agents.
State – health state of a person (susceptible, infected etc.)
Graph – data structure which encodes the relationships such as visits or interactions.
Intervention – alter the activity and state of a person.
CONCEPTS
Disease Parameters
Model SEIR
Transmissibility 0.00003
Incubation period 2 days
Infectious period 4 days
Interventions NI, EI
Location Groups 100
Infectivity 1
Susceptibility 1
Weak Scaling
In this work, we present a suite of kernels that together form benchmark for contagion diffusion simulation. We provide an encoding of the benchmark specifications using C++11 templates and iterators that is generic and composable, i.e., different implementations of kernels can be composed together to arrive at alternative implementation of the benchmark.
The benchmark is used to evaluate performance of two class of machines based upon the TIPS metric. Preliminary results indicate that the BlueRidge system is more scalable than Shadowfax.
Ongoing work: Our current work is focused on two major aspects –(1) Standardization – developing codes for our benchmark to make it compatible to any standard graph library which implements its basic specifications.(2) Distributed/Shared memory implementation – we aim to lift the sequential implementation for large scale shared memory and distributed memory implementation without affecting the genericness and simplicity of the benchmark.
Serial Performance
Experimental Setup
Kernel 5Kernel 4
Parallel Performance
0
20
40
60
80
100
120
17 18 19
tim
e in
se
con
ds
Scale (in powers of 2)
Kernel 2
Kernel 2
Note:
The serial performance of kernel 2 (Person-Person Graph) is provided for above graph.
The performance of kernel 0, 1 and 3 are not provided here as their running times are significantly low (and partly because of space restrictions)
The benchmark ranks different compute platforms based upon the TIPS metric More Scalable Less Scalable
Compiler GCC 4.7.2
TBB 4.2
BOOST 1.55
Node Interventions
State: InfectedInfectivity: 0.5
Properties
Person (Vertex)
Less Susceptible
State: InfectedInfectivity: 1
Properties
Person (Vertex)
Properties
Person (Vertex)
Properties
Person (Vertex)
State: SusceptibleSusceptibility: 0.5
State: SusceptibleSusceptibility: 1
Less Infectious
ACKNOWLEDGEMENT
We thank the members of Network Dynamics and Simulation Science Laboratory (NDSSL) for their contributions useful discussions and comments.
LG1
LG2
W1
Worker
TBB Thread MPI Process
T1
LGn
W2T2
WiTn
Task Pool
L1
L2
Ln
Tasks
TASK BASED WORK STEALING MECHANISMWorker – Intel TBB Thread or MPI ProcessCoarse Grained – Every worker handles a location group and creates location group tasks (T1,…, Tn)Fine Grained – Every location group task further spawns sub-location tasks (T11, T12,…,Txx)Stealing – Idle Workers steal the task from task pool (T1,T2,…,T11, Txx)
T11
T12
Ti3
Worker PoolTasks
LG1
LG2
W1T12
LGn
W2
Wi
Local Task Pool
L1
L2
Ln
Each MPI Process spawns TBB threads and creates its local task pool;one task per location
Fine Grained Coarse Grained
Map Location Groups to MPI
Processes
T11
T2x
Tnx
Two locations for LG1
Fine Grained Coarse Grained
Shared Memory Distributed Memory