Upload
duongdiep
View
213
Download
0
Embed Size (px)
Citation preview
Turning Big data into knowledgeTechniques and Tools for Parallel Computing on Online Data Streams in Systems Biology and Epidemiology
IMPACT BioBITs
Marco AldinucciUniversity of Torino, Italy
Outline
• Stochastic Formal System Biology• From Distributed to Multicore and back• On programming models
• FastFlow
• The CWC parallel simulator for sys bio• Some preliminary results
System Biology & Gillespie’s algorithm
• Traditionally studied with continuous ordinary differential equations (ODE)
• bulk reactions, i.e. average behaviour
• Gillespie algorithm: discrete and stochastic simulation of a system via explicit simulation of each reaction
• Gillespie realization represents a random walk that exactly represents the distribution of the master equation (i.e. ODEs)
• under some hypothesis
Gillespie’s algorithm [77]1. Initialization: Initialize the number of molecules in the
system, reactions constants, and random number generators.
2. Monte Carlo step: Generate random numbers to determine the next reaction to occur as well as the time interval. The probability of a given reaction to be chosen is proportional to the number of substrate molecules.
3. Update: Increase the time step by the randomly generated time in Step 2. Update the molecule count based on the reaction that occurred.
4. Iterate: goto Step 2 unless the number of reactants is zero or the simulation time has been exceeded.
Increasingly popular approach
• Sometime more informative than (ODE)• multi-stability, divergent or rare behaviours,
peaks, ...• multi-scale systems
• e.g. deriving macro behaviour from micro
• More computational demanding• much more, especially in motivating cases
Increasingly popular approach• Bio-PEPA [Hillston, Ciocchetta]• SPiM [Cardelli, Phillips]• Stochastic Pi [Priami]• Stochkit [Petzold]• Spatial Pi [Uhrmacher]• Calculus of Wrapped Components [our own]
• kinetics: mass-action, Michaelis–Menten, Hill ...
• ...
http://mc-fastflow.sourceforge.net
BioBITs
IntroductionSCWC
Conclusions
The CalculusThe SimulatorA Case Study
The Calculus of Wrapped Compartments (CWC)
A term is intended to represent a biological system. A term isbuilt by means of the compartment constructor, (�⇤�), from aset E of atomic elements, ranged over by a, b, c , d . A simpleterm is defined as:
t ::= a�� (a ⇤ t)
We write t to denote a (possibly empty) multiset of simple termst1 . . . tn. Similarly, with a we denote a (possibly empty) multiset ofatoms.
M. Coppo, F. Damiani, M. Drocco, E. Grassi, A. Troina Stochastic Calculus of Wrapped Compartments
IntroductionSCWC
Conclusions
The CalculusThe SimulatorA Case Study
Examples of SCWC terms
(i) represents (a b c ⇥ •);(ii) represents (a b c ⇥ (d e ⇥ •));(iii) represents (a b c ⇥ (d e ⇥ •) f g).
M. Coppo, F. Damiani, M. Drocco, E. Grassi, A. Troina Stochastic Calculus of Wrapped CompartmentsIntroduction
SCWCConclusions
The CalculusThe SimulatorA Case Study
Dynamics of SCWC
Rewrite rules are defined as pairs of terms, in which the left termcharacterizes the portion of the system in which the eventmodelled by the rule can occur, and the right one describes howthat portion of the system is changed by the event.
Biomolecular Event Examples of CWC Rewrite Rules
State change a ⇥� bComplexation a b ⇥� cCatalyzed a (b x ⇤ y) ⇥� (b x ⇤ a y)membrane crossing (b x ⇤ a y) ⇥� a (b x ⇤ y)
M. Coppo, F. Damiani, M. Drocco, E. Grassi, A. Troina Stochastic Calculus of Wrapped Compartments
IntroductionSCWC
Conclusions
The CalculusThe SimulatorA Case Study
Stochastic Rules
Rules are decorated with a rate (speed of the reaction).
A Stochastic Rewrite Rule, R, is denoted by Pk⇤⇥ P 0.
The stochastic semantics is given by transitions between termslabeled with the rule applied, R, and a transition rate dependingon the rate of rule R:
tR,k⇥p�⇥ t 0
where R is Pk⇤⇥ P 0, and p is the number of di�erent ways in which
the pattern P may match t (t = C [P�]) and such thatt 0 = C [P 0�] for some context C and variable instantiation �.
M. Coppo, F. Damiani, M. Drocco, E. Grassi, A. Troina Stochastic Calculus of Wrapped Compartments
Ex: HIV and immune response(progression to AIDS)
U
V4
Z5T
Z4
V5
F
I4
λδuf
δut
δuz
δtf
I5βX4
βR5
γR5
γX4
σ2
σ2
δiz
δiz
σ3
π
π
μ
V virus, all phenotypes (V4,V5)
the spike suggests the mutation V5 →V4 (more aggressive)
immune response Z=Z4+Z5 remain stable (but for the peak).
now Z4 decrease and Z5 increase
i.e. HIV is turning intoAIDS more rapidly
M. Aldinucci, A. Bracciali, P. Liò, A. Sorathiya, and M. Torquati. StochKit-FF: Efficient systems biology on multicore architectures. In High Performance Bioinformatics and Biomedicine (HiBB), vol. 6586 of LNCS, 2011. Springer.
• Peaks are informative events,• virus mutation triggers AIDS progression• hardly detected with ODEs
• high resolution required to detect spikes,• each trajectory can be over 6G Bytes of data
• and thousands of trajectories are needed• compute everything, save everything, move and
join all data, analyse all data, then get first results• often to discover parameters are wrong ...
• It is Monte Carlo,• well understood• easy to parallelise on different trajectories
• it is Monte Carlo w Markov Chains models (CTMC)• single trajectory: no parallelisation without relaxation• compute time ≠ simulation time• compute time for different trajectories heavily unbalanced
• fast reactions and slow reactions, some not interesting (e.g. water-steam-water)
Unbalancing + filtering
• few trajectories (e.g. the interesting ones) can significantly delay the completion of others
• over-provisioning don’t help that much• simulated time moves at different pace w.r.t. wall-
clock time • data joining from different trajectories should be
aligned at the same simulation timereducereduce reducereduce
a1 a2 a3 a4 b1 b2 b3 b4
c1 c2 c3 c4 d1 d2 d3 d4
e1 e2 e3 e4 f1 f2 f3 f4
{a,b}@P1
{c,d}@P2
{e,f}@P3
a1 b1
c1
e1
d1
f1
b2
c2
e2 f2
b3
a2
d2
e3 f3
c3
b4
d3
a3 e4
f4c4 d4
a4
{a,b}@P1
{c,d}@P2
{e,f}@P3
idle
F(a,...,f)
F(a,...,f)
idle
a1 b1
c1
e1
d1
f1
b2
c2
e2 f2
b3
a2
d2
e3 f3
c3
b4
d3
a3 e4
f4c4 d4
a4
{a,b}@P1
{c,d}@P2
{e,f}@P3
F2(a,...f)
simulationwall-clocktime
trajectory reduction
gainF1(a,...f)
F3(a,...f)
F4(a,...f)
ii)
iii)
i)
1's runs 2's runs 4's 3's
• It is Monte Carlo,• well understood• easy to parallelise on different trajectories
• it is Monte Carlo w Markov Chains models (CTMC)• single trajectory: no parallelisation without relaxation• compute time ≠ simulation time• compute time for different trajectories heavily unbalanced
• fast reactions and slow reactions, some not interesting (e.g. water-steam-water)
• it is Monte Carlo AND data analysis• data is big, analysis can be very expensive and it typically starts
after the simulation• the whole workflow is perceived too “slow” by bio-scientists to
be really useful
CollectorTraiectories
Post-processinge.g. parsing
Disk
From Distributed to Multicore and back
Rebalance
In BioSims the whole trajectory is needed, easily
GB of data
The whole dataset, TB of data, should be re-read to extract statistical estimators
SchedulerParameter
Sweep
Model description
Pre-processinge.g. parsing
Worker nrun simulations
Worker 1run simulations
one or moresimulationinstance
one or moresimulationinstance
Disk 1
Disk n
From Distributed to Multicore and back
RebalanceSchedulerParameter
Sweep
Model description
Pre-processinge.g. parsing
Worker nrun simulations
Worker 1run simulations
one or moresimulationinstance
one or moresimulationinstance
Disk 1Collector
TraiectoriesPost-processing
e.g. parsing
Now the issue become a real problem Bottlenecks: disk and memoryPost-processing: not pipelined and often not parallel at all
From Distributed to Multicore and back
• Multi Carlo sims for Bio are I/O-bound• Sampling reduce I/O traffic but worsen precision and
analysis of “strange” dynamics (spikes, diversion from average, etc.), which observation motivates stochastic analysis (ODEs)
• Data analysis is also I/O-bound• if approached is a “post-processing” fashion, data should be
retrieved from the disks
• The porting of distributed solution “as is” on multicore is going insist on weakness of multicore architectures
• Memory wall, I/O, disk• SIMD/GPGPUs do not change the analysis substantially
From Distributed to Multicore and back
• The same arguments holds on distributed, grids, and clouds as soon as the workflow is considered as a whole
• simulation, data collection and merging, analysis
• Rationale• Manage data as stream, compute online
• May require more computing and less bandwidth• Computation should be designed to be pipelined
• Establish fast data paths across cores/hosts
• Avoid low-level concurrency management• Portability, performance, portability of performance, maintenance, porting
from sequential
Quotes from P. Beckman EuroPar 11 “exascale” keynote
• coarse grain concurrency is nearly exhausted
• it is not about Flops, it is about data movement
• programming systems should be designed to support fast data movement and enforce locality
• shared-memory & inter-socket messaging
• we need a programming model
• a computer language is not a computing modela library is not a computing model
• we need a efficient and compositional run-time
18
Sharedmemory
Macro Data Flow
Beowulf
Grid
MPP
Autonomic
19
P3L1991
SKiE1997
OCamlP3L1998
SKElib2000
Lithium2002
Muskel2006
FastFlow2009
Eskimo2003
ASSIST2001
ASSISTant2008
GCM2008
GPG
PUs
Clouds, clusters of multi-cores and many-cores
Patterns/skeletons & streams
it is not about Flops, it is about data movement, we need compositional run-time
• Streams• focus on data movements at the prog model level• clear semantics• support compositionally and also locality
• the latter is a bit more counter-intuitive
• High-level programming• e.g. patterns
• Patterns + streams• can be implemented efficiently on both multi-core,
distributed, and both
FastFlowhttp://mc-fastflow.sourceforge.net/
FastFlow (multicore)
Streaming network patternsSkeletons: pipeline, map farm, reduce, D&C, ...
Arbitrary streaming networksLock-free SPSC/MPMC queues + FF nodes
Simple streaming networksLock-free SPSC queues + threading model
FastFlow
Multicore and manycoreSMP: cc-UMA & cc-NUMA
Applications on multicore, many-coreEfficient and portable - designed with high-level patterns
Layer 1: Simple streaming networks4 sockets x 8 core x 2 contexts
Xeon E7-4820 @2.0GHz Sandy Bridge 18MB L3 shared cache, 256K L2
0
5
10
15
20
25
30
35
40
45
64 128 256 512 1k 2k 4k 8k
nanose
conds
buffer size
P and C on different cores same CPU
1.50 1.37 0.83
1.330.75 0.82 0.50 0.46
0
5
10
15
20
25
30
35
40
45
64 128 256 512 1k 2k 4k 8k
nanoseco
nd
s
buffer size
P and C on different CPUs
3.333.15
2.23
2.91
3.56
4.08
2.43
2.76
0
5
10
15
20
25
30
35
40
45
64 128 256 512 1k 2k 4k 8k
nanose
conds
buffer size
P and C on same core distinct contexts
0.19 0.14 0.12 0.11 0.09 0.11 0.11 0.11
different sockets
MPI is ~190 ns at best(D.K. Panda)
same socket same core different contexts same socket different cores different
Layer 1: Simple streaming networks
DRAFT
1 int size = N; //SPSC size
2 bool push(void∗ data) {3 if (buf w�>full()) {4 SPSC∗ t = pool.next w();
5 if (! t) return false;
6 buf w = t;
7 }8 buf w�>push(data);
9 return true;
10 }11 bool pop(void∗∗ data) {12 if (buf r�>empty()) {13 if (buf r == buf w) return false;
14 if (buf r�>empty()) {15 SPSC∗ tmp = pool.next r();
16 if (tmp) {17 pool. release (buf r) ;
18 buf r = tmp;
19 }20 }21 }22 return buf r�>pop(data);
23 }
25 struct Pool {26 dSPSC inuse;
27 SPSC cache;
29 SPSC∗ next w() {30 SPSC∗ buf;
31 if (!cache.pop(&buf))
32 buf = allocateSPSC(size);
33 inuse.push(buf);
34 return buf;
35 }36 SPSC∗ next r() {37 SPSC∗ buf;
38 return (inuse.pop(&buf)? buf : NULL);
39 }40 void release(SPSC∗ buf) {41 buf�>reset(); // reset pread and pwrite
42 if (!cache.push(buf))
43 deallocateSPSC(buf);
44 }45 }
Fig. 3: Unbounded wait-free uSPSC queue implementation.
impossible since, if the consumer switches to the next bu↵er while the previousone is not really empty, a data loss will occur. In the next section we prove thatthe if condition at line §3.�� is su�cient to ensure correct execution.
Theorem 2 (uSPSC). The uSPSC unbound queue sketched in Fig. 3 is correct(and wait-free) under any memory consistency model provided that it is built withinternal SPSC queues with size > 1.
Proof. The SPSC queue used as basic building block of the uSPSC queue hasbeen proved correct in [18,13]. Both the producer and the consumer initiallywork on the same bu↵er. The correct execution of pop and push is guaranteedby the correctness of the bu↵er (i.e. the internal SPSC queue) up to the momentthe bu↵er becomes full and the producer starts writing to a new bu↵er. We havetwo cases: the bu↵er the consumer is reading from is either non-empty or empty.In the former case, the correctness is again ensured by bu↵er correctness. Thelatter case is more subtle.
In a relaxed memory consistency model the consumer can be aware that theproducer has changed the writing bu↵er only with the writing of the data of theprevious push because of the Write Memory Barrier (WMB) at line §1.�. In fact,without the WMB, the new value of buf w (line §3.�) and the value to writtento the bu↵er (buf[pwrite] in the previous push at line §1.�) might appear inmemory in any order. Thus — in principle — it can be possible that the readingbu↵er buf r is still perceived as empty and a new writing bu↵er has been alreadystarted (buf r 6= buf w); the condition at line §3.�� could therefore evaluate totrue even if the previous bu↵er is not actually empty. This condition mightlead to data loss because the consumer might overtake and abandon a bu↵erstill holding a valid value. In the uSPSC implementation this cannot happen
M. Aldinucci, S. Campa, M. Danelutto, M. Torquati. An Efficient Synchronisation Mechanism for Multi-Core Systems.
EuroPar 2012.Wed 29 Aug - B3 multicore 14.30-16.00
tail head
dynamiclinked listof circular
buffers
P C
Layer 1: Simple streaming networks http://www.1024cores.net/home/technologies/fastflow
S2
S3 S4
S1 Sn
...
25
50K
100K
150K
200K
1 8 16 24 32 40 48 56 64
Thro
ugh
put (m
sg/s
)
n. threads
uSPSCdSPSC
dSPSC no cache
4 sockets x 8 core x 2 contexts
Xeon E7-4820 @2.0GHz Sandy Bridge 18MB L3 shared cache, 256K L2
Linked list w/ dyn allocMichael-Scott queue
well-known ~ 400 citations 0
50
100
150
200
250
300
350
400
450
64 1024 8192
nanose
conds
buffer size
mapping 1mapping 2mapping 3
Linked list with poolingOpt Michael-Scott queue
(our result)
20x faster
0
5
10
15
20
25
30
35
40
45
64 1024 8192
nanose
cond
s
buffer size
mapping 1mapping 2mapping 3
Linked list + circular bufferFastFlow queue
(our result)
12x faster
same coresame socketdifferent sockets
Speedup
<35 ns irrespectivelyof the mapping
Layer 3: streaming networks patterns
• Composition via C++ template meta-programming
• CPU: Graph composition
• GPU: CUDA streams
• CPU+GPU: offloading
• farm{ pipe }
• pipe(farm, farm)
• pipe(map, reduce)
• ....26
copy_D2H
kernel
copy_H2D
farm_start
copy_D2H
kernel
copy_H2D
copy_D2H
kernel
copy_H2D
farm_start
• farm• on CPU - master-worker - parallelism exploitation• on GPU - CUDA streams - automatic exploitation of asynch
comm
• pipeline• on CPU - pipeline• on GPU - sequence of kernel calls or global mem synch
• map• on CPU - master-worker - parallelism exploitation• on GPU - CUDA SIMT - parallelism exploitation
• reduce• on CPU - master-worker - parallelism exploitation• on GPU - CUDA SIMT (reduction tree) - parallelism exploitation
• D&C• on CPU - master-worker with feedback - // exploitation• on GPU - working on it, maybe loop+farm
Layer 3: streaming networks patterns
stream[k]
copy_D2H
kernel
copy_H2D
stream[0]
farm_start
stream[1] stream[n]
farm_start
+ distributed
• Generic ff_node is subclassed to ff_dnode• ff_dnode can support network channels
• P2P or collective• used as frontier node of streaming graph• can be used to merge graphs across distributed platforms
• No changes to programming model• at least require to “add” stub ff_dnode• when passing pointers data is serialised
• serialisation hand-managed (zero-copy, think to Java!)
M. Aldinucci, S. Campa, M. Danelutto, M. Torquati, P. Kilpatrick. Targeting distributed systems in FastFlow. Should be next talk!
Streaming network patternsSkeletons: pipeline, map farm, reduce, D&C, ...
Arbitrary streaming networksLock-free SPSC/MPMC queues + FF nodes
Simple streaming networksLock-free SPSC queues + threading model
FastFlow
Multicore and manycoreSMP: cc-UMA & cc-NUMA
Applications on multicore, many core & distributed platforms of multicoresEfficient and portable - designed with high-level patterns
Distributed platformsClouds, clusters of SMPs
Simple streaming networksZero copy networking + processes model
Arbitrary streaming networksCollective communications + FF Dnodes
simeng
gath
er
simeng
disp
tach
incomplete simulation tasks (with load balancing)
farm
alig
nmen
t of
traje
ctor
ies
gene
ratio
n of
si
mul
atio
n ta
sks
simulation pipeline
start new simulations, steer and terminate running simulations
stateng
gath
er
disp
tach
farm
gene
ratio
n of
sl
idin
g w
indo
ws
of tr
ajec
torie
sraw simulation
results display ofresults
---Graphical
UserInterface
meanvariance k-means
filtered simulation
results
stateng
analysis pipelinemain pipeline
reducereduce reducereduce
a1 a2 a3 a4 b1 b2 b3 b4
c1 c2 c3 c4 d1 d2 d3 d4
e1 e2 e3 e4 f1 f2 f3 f4
{a,b}@P1
{c,d}@P2
{e,f}@P3
a1 b1
c1
e1
d1
f1
b2
c2
e2 f2
b3
a2
d2
e3 f3
c3
b4
d3
a3 e4
f4c4 d4
a4
{a,b}@P1
{c,d}@P2
{e,f}@P3
idle
F(a,...,f)
F(a,...,f)
idle
a1 b1
c1
e1
d1
f1
b2
c2
e2 f2
b3
a2
d2
e3 f3
c3
b4
d3
a3 e4
f4c4 d4
a4
{a,b}@P1
{c,d}@P2
{e,f}@P3
F2(a,...f)
simulationwall-clocktime
trajectory reduction
gainF1(a,...f)
F3(a,...f)
F4(a,...f)
ii)
iii)
i)
1's runs 2's runs 4's 3's
M. Aldinucci, M. Coppo, F. Damiani, M. Drocco, M. Torquati, A. Troina. On designing multicore-aware simulators for biological systems. PDP 2011. 2011. IEEE.
time
trajectories Trajectory 1
Trajectory n
...
simeng
gath
er
simeng
disp
tach
incomplete simulation tasks (with load balancing)
farm
alig
nmen
t of
traje
ctor
ies
gene
ratio
n of
si
mul
atio
n ta
sks
simulation pipeline
start new simulations, steer and terminate running simulations
stateng
gath
er
disp
tach
farm
gene
ratio
n of
sl
idin
g w
indo
ws
of tr
ajec
torie
sraw simulation
results display ofresults
---Graphical
UserInterface
meanvariance k-means
filtered simulation
results
stateng
analysis pipelinemain pipeline
time
trajectories
simulation-time aligned trajectories
simeng
gath
er
simeng
disp
tach
incomplete simulation tasks (with load balancing)
farm
alig
nmen
t of
traje
ctor
ies
gene
ratio
n of
si
mul
atio
n ta
sks
simulation pipeline
start new simulations, steer and terminate running simulations
stateng
gath
er
disp
tach
farm
gene
ratio
n of
sl
idin
g w
indo
ws
of tr
ajec
torie
sraw simulation
results display ofresults
---Graphical
UserInterface
meanvariance k-means
filtered simulation
results
stateng
analysis pipelinemain pipeline
time
trajectories
dispatch
task
task
task
simeng
gath
er
simeng
disp
tach
incomplete simulation tasks (with load balancing)
farm
alig
nmen
t of
traje
ctor
ies
gene
ratio
n of
si
mul
atio
n ta
sks
simulation pipeline
start new simulations, steer and terminate running simulations
stateng
gath
er
disp
tach
farm
gene
ratio
n of
sl
idin
g w
indo
ws
of tr
ajec
torie
sraw simulation
results display ofresults
---Graphical
UserInterface
meanvariance k-means
filtered simulation
results
stateng
analysis pipelinemain pipeline
simeng
gather
simeng
disp
tach
Scat
ter c
onsu
mer
From
Any
prod
ucer
mean variance
quantiles
QT clusters
k-means
M. Aldinucci, M. Coppo, F. Damiani, M. Drocco, E. Sciacca, S. Spinella, M. Torquati, A. Troina. On parallelizing on-line statistics for stochastic biological simulations. In Euro-Par 2011 Workshops, HiBB, vol 7155 of LNCS.
simeng
gath
er
simeng
disp
tach
incomplete simulation tasks (with load balancing)
farm
alig
nmen
t of
traje
ctor
ies
gene
ratio
n of
si
mul
atio
n ta
sks
simulation pipeline
start new simulations, steer and terminate running simulations
stateng
gath
er
disp
tach
farm
gene
ratio
n of
sl
idin
g w
indo
ws
of tr
ajec
torie
sraw simulation
results display ofresults
---Graphical
UserInterface
meanvariance k-means
filtered simulation
results
stateng
analysis pipelinemain pipeline
simeng
gather
simeng
disp
tach
Scat
ter c
onsu
mer
From
Any
prod
ucersim
eng
gath
er
simeng
disp
tach
sim farm
gene
ratio
n of
si
mul
atio
n ta
sks
(sca
tter p
rodu
cer)
Scat
ter c
onsu
mer
From
Any
prod
ucer
simulation pipeline (host 1)
simeng
gath
er
simeng
disp
tach
sim farm
Scat
ter c
onsu
mer
From
Any
prod
ucer
simulation pipeline (host n) alig
nmen
t of
traje
ctor
ies
(sca
tter c
ons)
farm of simulation pipelines
(distributed)
main pipeline (host 0)
stateng
gath
er
disp
tach
farm
gene
ratio
n of
sl
idin
g w
indo
ws
of tr
ajec
torie
sraw simulation
results display ofresults
---Graphical
UserInterface
meanvariance k-means
filtered simulation
results
stateng
analysis pipelinemain pipeline
gene
ratio
n of
si
mul
atio
n ta
sks
(sca
tter p
rodu
cer)
Scat
ter c
onsu
mer
Scat
ter c
onsu
mer
From
Any
prod
ucer
From
Any
prod
ucer
alig
nmen
t of
traje
ctor
ies
(sca
tter c
ons)
simeng
gath
er
simeng
disp
tach
sim farm
gene
ratio
n of
si
mul
atio
n ta
sks
(sca
tter p
rodu
cer) stat
eng
gath
er
disp
tach
farm
gene
ratio
n of
sl
idin
g w
indo
ws
of tr
ajec
torie
sraw simulation
results display ofresults
---Graphical
UserInterface
meanvariance k-means
filtered simulation
results
stateng
analysis pipeline
Scat
ter c
onsu
mer
From
Any
prod
ucer
simulation pipeline (host 1)
simeng
gath
er
simeng
disp
tach
sim farm
Scat
ter c
onsu
mer
From
Any
prod
ucer
simulation pipeline (host n) alig
nmen
t of
traje
ctor
ies
(sca
tter c
ons)
farm of simulation pipelines
(distributed)
start new simulations, steer and terminate running simulations
main pipeline (host 0)
serialisation, coalescingallocation (zero-copy 0MQ), de-
serialisation
simeng
gath
er
simeng
disp
tach
incomplete simulation tasks (with load balancing)
farm
alig
nmen
t of
traje
ctor
ies
gene
ratio
n of
si
mul
atio
n ta
sks
simulation pipeline
start new simulations, steer and terminate running simulations
stateng
gath
er
disp
tach
farm
gene
ratio
n of
sl
idin
g w
indo
ws
of tr
ajec
torie
sraw simulation
results display ofresults
---Graphical
UserInterface
meanvariance k-means
filtered simulation
results
stateng
analysis pipelinemain pipeline
simeng
gath
er
simeng
disp
tach
incomplete simulation tasks (with load balancing)
farm
alig
nmen
t of
traje
ctor
ies
gene
ratio
n of
si
mul
atio
n ta
sks stat
eng
gath
er
disp
tach
farm
gene
ratio
n of
sl
idin
g w
indo
ws
of tr
ajec
torie
sraw simulation
results display ofresults
---Graphical
UserInterface
meanvariance k-means
filtered simulation
results
stateng
simulation pipeline analysis pipeline
start new simulations, steer and terminate running simulations
main pipeline
Performance(preliminary)
0
5
10
15
20
25
30
35
5 10 15 20 25 30
spee
dup
n. sim. workers
ideal128 trajectories512 trajectories
1024 trajectories 0
5
10
15
20
25
30
35
5 10 15 20 25 30
spee
dup
n. sim. workers
ideal128 trajectories512 trajectories
1024 trajectories
IntelNehalem32-core
1 stat engine 4 stat engines
0
5
10
15
20
25
0 5 10 15 20 25
spee
dup
aggreated n. of cores
ideal2 cores per host4 cores per host
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8
spee
dup
n.of hosts
ideal2 cores per host4 cores per host
Cluster 8xIntel Xeon 6-coreInfinband (IPoIB)
courtesy of Mellanox
4 stat engines 1024 trajectories4 stat engines 1024 trajectories
Involved data
• Simple examples (neurospora, ...)• 2-4 double * n. of variables * n. of samples x n.
of trajectories * cases in sensitivity analysis• e.g.4*8*4*1M*1k*8 ~ 1 TBytes
• HIV 6GB x 1024 trajectories ~ 6TB• The more observed variables, precision,
cases for sensitivity analysis the more data
• CMS: Compact Muon Solenoid at CERN• 3500 scientists, 180 Universities and Research Labs (40 countries)• CMS is like a ~75 MegaPixels Digital Camera. 40M “photos”/s
Selection of 300 ‘photos’/s ~450 MB/s from the detector are ~PBs of data/year
• CERN has (of course) its well established data flow and infrastructure, however ...
IMPACT Innovative Methods for Particle Colliders at the Terascale(2012-2015)
IMPACT Innovative Methods for Particle Colliders at the Terascale(2012-2015, oversimplified vision)
huge data Diff(somehow)
Monte Carlo simulationof “known” traces
Filtered datato be analysed
(much smaller)
huge data
Particle physicists Theoretic physicists
new knowledge
Monte Carlo simulationssmaller but frequent
requires great “reactivity”of the workflow to tune models
new speculations
Preliminary results onchannel H→ZZ→4l
shown at Higg’s boson claim Jul 2012
(Amapane at al.)
• •
• o o
o
! !
Direct competition: unstable switch
CWC syntax
Courtesy of Luca Cardelli
On Switches and Oscillators Program
Equivalence in Biology?
http://lucacardelli.name
... after a number of transformations: a stable switch faithfully modelling cell switch
CWC syntax
Courtesy of Luca Cardelli
On Switches and Oscillators Program
Equivalence in Biology?
http://lucacardelli.name
50
100
150
200
250
300
350
400
450
500
550
600
0 5 10 15 20
K-m
eans
clu
ster
s on
mol
ecul
es (a
)
time
0
100
200
300
400
500
600
700
0 5 10 15 20
num
ber o
f "a"
mol
ecul
es
time
raw simulationsstandard deviation
mean
Schlögl modelautocatalytic, trimolecular reaction scheme (bistable)
Bacteriophage λ life cycleintegration of a strand of DNA in the molecule of E. coli DNA(multi-stable)
0
10
20
30
40
50
60
70
0 20 40 60 80 100 120
# M
olec
ules
(CI)
time
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120
QT
clu
ste
rs w
ith t
ren
ds
(CI)
time
Transcriptional regulation in Neurospora(circadian clock period detection)
0.026
0.028
0.03
0.032
0.034
0.036
0.038
0.04
dark light dark light
Peak
freq
uenc
y
Light condition phases
0
200
400
600
800
1000
Ligh
t Con
ditio
n
0
200
400
600
800
1000
0 50 100 150 200
Dar
k C
ondi
tion
Time in hours
Conclusions• Talk focused on programming model
• Many important“in a cloud” ignored in the talk: middleware, faults, ...
• Data movement & high-level are key features• helps the mapping of features to platforms and performance portability• ease the design• MapReduce is an instance, should be not the only one
• Formal biology at embryonal stage• similar data & computation problems in analysis “in
vitro” experiments• increasing interest from industrial and “core” bio
scientist for parallel computing
• FastFlow• Massimo Torquati (Pisa), Marco Danelutto (Pisa), Peter Kilpatrick
(Belfast), Massimiliano Meneghin (IBM Research Dublin)
• Case studies, biological models• Luca Cardelli (Microsoft), Pietro Liò (Cambridge), Andrea
Bracciali (Stirling), Cristina Calcagno (Torino)
• CWC simulation implementation• Maurizio Drocco (Torino), Fabio Tordini (Torino)
• CWC language design• Maurizio Drocco, Eva Sciacca, Salvatore Spinella, Mario Coppo,
Angelo Troina, Ferruccio Damiani (Torino)
• Graphical User Interface• ETICA company
• Infiniband cluster and computing facilities• Mellanox, HPC Advisory Council
Acknowledgements
IMPACT
BioBITs