51
Turning Big data into knowledge Techniques and Tools for Parallel Computing on Online Data Streams in Systems Biology and Epidemiology IMPACT BioBITs Marco Aldinucci University of Torino, Italy

Turning Big data into knowledgecalvados.di.unipi.it/storage/talks/2012_BDMC_Europar_Invited.pdf · StochKit-FF: Efficient systems biology on multicore architectures. ... e1 e2 e3

Embed Size (px)

Citation preview

Turning Big data into knowledgeTechniques and Tools for Parallel Computing on Online Data Streams in Systems Biology and Epidemiology

IMPACT BioBITs

Marco AldinucciUniversity of Torino, Italy

Outline

• Stochastic Formal System Biology• From Distributed to Multicore and back• On programming models

• FastFlow

• The CWC parallel simulator for sys bio• Some preliminary results

Formal synthetic system biology

?

System Biology & Gillespie’s algorithm

• Traditionally studied with continuous ordinary differential equations (ODE)

• bulk reactions, i.e. average behaviour

• Gillespie algorithm: discrete and stochastic simulation of a system via explicit simulation of each reaction

• Gillespie realization represents a random walk that exactly represents the distribution of the master equation (i.e. ODEs)

• under some hypothesis

Gillespie’s algorithm [77]1. Initialization: Initialize the number of molecules in the

system, reactions constants, and random number generators.

2. Monte Carlo step: Generate random numbers to determine the next reaction to occur as well as the time interval. The probability of a given reaction to be chosen is proportional to the number of substrate molecules.

3. Update: Increase the time step by the randomly generated time in Step 2. Update the molecule count based on the reaction that occurred.

4. Iterate: goto Step 2 unless the number of reactants is zero or the simulation time has been exceeded.

Increasingly popular approach

• Sometime more informative than (ODE)• multi-stability, divergent or rare behaviours,

peaks, ...• multi-scale systems

• e.g. deriving macro behaviour from micro

• More computational demanding• much more, especially in motivating cases

Increasingly popular approach• Bio-PEPA [Hillston, Ciocchetta]• SPiM [Cardelli, Phillips]• Stochastic Pi [Priami]• Stochkit [Petzold]• Spatial Pi [Uhrmacher]• Calculus of Wrapped Components [our own]

• kinetics: mass-action, Michaelis–Menten, Hill ...

• ...

http://mc-fastflow.sourceforge.net

BioBITs

IntroductionSCWC

Conclusions

The CalculusThe SimulatorA Case Study

The Calculus of Wrapped Compartments (CWC)

A term is intended to represent a biological system. A term isbuilt by means of the compartment constructor, (�⇤�), from aset E of atomic elements, ranged over by a, b, c , d . A simpleterm is defined as:

t ::= a�� (a ⇤ t)

We write t to denote a (possibly empty) multiset of simple termst1 . . . tn. Similarly, with a we denote a (possibly empty) multiset ofatoms.

M. Coppo, F. Damiani, M. Drocco, E. Grassi, A. Troina Stochastic Calculus of Wrapped Compartments

IntroductionSCWC

Conclusions

The CalculusThe SimulatorA Case Study

Examples of SCWC terms

(i) represents (a b c ⇥ •);(ii) represents (a b c ⇥ (d e ⇥ •));(iii) represents (a b c ⇥ (d e ⇥ •) f g).

M. Coppo, F. Damiani, M. Drocco, E. Grassi, A. Troina Stochastic Calculus of Wrapped CompartmentsIntroduction

SCWCConclusions

The CalculusThe SimulatorA Case Study

Dynamics of SCWC

Rewrite rules are defined as pairs of terms, in which the left termcharacterizes the portion of the system in which the eventmodelled by the rule can occur, and the right one describes howthat portion of the system is changed by the event.

Biomolecular Event Examples of CWC Rewrite Rules

State change a ⇥� bComplexation a b ⇥� cCatalyzed a (b x ⇤ y) ⇥� (b x ⇤ a y)membrane crossing (b x ⇤ a y) ⇥� a (b x ⇤ y)

M. Coppo, F. Damiani, M. Drocco, E. Grassi, A. Troina Stochastic Calculus of Wrapped Compartments

IntroductionSCWC

Conclusions

The CalculusThe SimulatorA Case Study

Stochastic Rules

Rules are decorated with a rate (speed of the reaction).

A Stochastic Rewrite Rule, R, is denoted by Pk⇤⇥ P 0.

The stochastic semantics is given by transitions between termslabeled with the rule applied, R, and a transition rate dependingon the rate of rule R:

tR,k⇥p�⇥ t 0

where R is Pk⇤⇥ P 0, and p is the number of di�erent ways in which

the pattern P may match t (t = C [P�]) and such thatt 0 = C [P 0�] for some context C and variable instantiation �.

M. Coppo, F. Damiani, M. Drocco, E. Grassi, A. Troina Stochastic Calculus of Wrapped Compartments

Ex: HIV and immune response(progression to AIDS)

U

V4

Z5T

Z4

V5

F

I4

λδuf

δut

δuz

δtf

I5βX4

βR5

γR5

γX4

σ2

σ2

δiz

δiz

σ3

π

π

μ

V virus, all phenotypes (V4,V5)

the spike suggests the mutation V5 →V4 (more aggressive)

immune response Z=Z4+Z5 remain stable (but for the peak).

now Z4 decrease and Z5 increase

i.e. HIV is turning intoAIDS more rapidly

M. Aldinucci, A. Bracciali, P. Liò, A. Sorathiya, and M. Torquati. StochKit-FF: Efficient systems biology on multicore architectures. In High Performance Bioinformatics and Biomedicine (HiBB), vol. 6586 of LNCS, 2011. Springer.

• Peaks are informative events,• virus mutation triggers AIDS progression• hardly detected with ODEs

• high resolution required to detect spikes,• each trajectory can be over 6G Bytes of data

• and thousands of trajectories are needed• compute everything, save everything, move and

join all data, analyse all data, then get first results• often to discover parameters are wrong ...

• It is Monte Carlo,• well understood• easy to parallelise on different trajectories

• it is Monte Carlo w Markov Chains models (CTMC)• single trajectory: no parallelisation without relaxation• compute time ≠ simulation time• compute time for different trajectories heavily unbalanced

• fast reactions and slow reactions, some not interesting (e.g. water-steam-water)

Unbalancing + filtering

• few trajectories (e.g. the interesting ones) can significantly delay the completion of others

• over-provisioning don’t help that much• simulated time moves at different pace w.r.t. wall-

clock time • data joining from different trajectories should be

aligned at the same simulation timereducereduce reducereduce

a1 a2 a3 a4 b1 b2 b3 b4

c1 c2 c3 c4 d1 d2 d3 d4

e1 e2 e3 e4 f1 f2 f3 f4

{a,b}@P1

{c,d}@P2

{e,f}@P3

a1 b1

c1

e1

d1

f1

b2

c2

e2 f2

b3

a2

d2

e3 f3

c3

b4

d3

a3 e4

f4c4 d4

a4

{a,b}@P1

{c,d}@P2

{e,f}@P3

idle

F(a,...,f)

F(a,...,f)

idle

a1 b1

c1

e1

d1

f1

b2

c2

e2 f2

b3

a2

d2

e3 f3

c3

b4

d3

a3 e4

f4c4 d4

a4

{a,b}@P1

{c,d}@P2

{e,f}@P3

F2(a,...f)

simulationwall-clocktime

trajectory reduction

gainF1(a,...f)

F3(a,...f)

F4(a,...f)

ii)

iii)

i)

1's runs 2's runs 4's 3's

• It is Monte Carlo,• well understood• easy to parallelise on different trajectories

• it is Monte Carlo w Markov Chains models (CTMC)• single trajectory: no parallelisation without relaxation• compute time ≠ simulation time• compute time for different trajectories heavily unbalanced

• fast reactions and slow reactions, some not interesting (e.g. water-steam-water)

• it is Monte Carlo AND data analysis• data is big, analysis can be very expensive and it typically starts

after the simulation• the whole workflow is perceived too “slow” by bio-scientists to

be really useful

CollectorTraiectories

Post-processinge.g. parsing

Disk

From Distributed to Multicore and back

Rebalance

In BioSims the whole trajectory is needed, easily

GB of data

The whole dataset, TB of data, should be re-read to extract statistical estimators

SchedulerParameter

Sweep

Model description

Pre-processinge.g. parsing

Worker nrun simulations

Worker 1run simulations

one or moresimulationinstance

one or moresimulationinstance

Disk 1

Disk n

From Distributed to Multicore and back

RebalanceSchedulerParameter

Sweep

Model description

Pre-processinge.g. parsing

Worker nrun simulations

Worker 1run simulations

one or moresimulationinstance

one or moresimulationinstance

Disk 1Collector

TraiectoriesPost-processing

e.g. parsing

Now the issue become a real problem Bottlenecks: disk and memoryPost-processing: not pipelined and often not parallel at all

From Distributed to Multicore and back

• Multi Carlo sims for Bio are I/O-bound• Sampling reduce I/O traffic but worsen precision and

analysis of “strange” dynamics (spikes, diversion from average, etc.), which observation motivates stochastic analysis (ODEs)

• Data analysis is also I/O-bound• if approached is a “post-processing” fashion, data should be

retrieved from the disks

• The porting of distributed solution “as is” on multicore is going insist on weakness of multicore architectures

• Memory wall, I/O, disk• SIMD/GPGPUs do not change the analysis substantially

From Distributed to Multicore and back

• The same arguments holds on distributed, grids, and clouds as soon as the workflow is considered as a whole

• simulation, data collection and merging, analysis

• Rationale• Manage data as stream, compute online

• May require more computing and less bandwidth• Computation should be designed to be pipelined

• Establish fast data paths across cores/hosts

• Avoid low-level concurrency management• Portability, performance, portability of performance, maintenance, porting

from sequential

Quotes from P. Beckman EuroPar 11 “exascale” keynote

• coarse grain concurrency is nearly exhausted

• it is not about Flops, it is about data movement

• programming systems should be designed to support fast data movement and enforce locality

• shared-memory & inter-socket messaging

• we need a programming model

• a computer language is not a computing modela library is not a computing model

• we need a efficient and compositional run-time

18

Sharedmemory

Macro Data Flow

Beowulf

Grid

MPP

Autonomic

19

P3L1991

SKiE1997

OCamlP3L1998

SKElib2000

Lithium2002

Muskel2006

FastFlow2009

Eskimo2003

ASSIST2001

ASSISTant2008

GCM2008

GPG

PUs

Clouds, clusters of multi-cores and many-cores

Patterns/skeletons & streams

it is not about Flops, it is about data movement, we need compositional run-time

• Streams• focus on data movements at the prog model level• clear semantics• support compositionally and also locality

• the latter is a bit more counter-intuitive

• High-level programming• e.g. patterns

• Patterns + streams• can be implemented efficiently on both multi-core,

distributed, and both

FastFlowhttp://mc-fastflow.sourceforge.net/

FastFlow (multicore)

Streaming network patternsSkeletons: pipeline, map farm, reduce, D&C, ...

Arbitrary streaming networksLock-free SPSC/MPMC queues + FF nodes

Simple streaming networksLock-free SPSC queues + threading model

FastFlow

Multicore and manycoreSMP: cc-UMA & cc-NUMA

Applications on multicore, many-coreEfficient and portable - designed with high-level patterns

Layer 1: Simple streaming networks4 sockets x 8 core x 2 contexts

Xeon E7-4820 @2.0GHz Sandy Bridge 18MB L3 shared cache, 256K L2

0

5

10

15

20

25

30

35

40

45

64 128 256 512 1k 2k 4k 8k

nanose

conds

buffer size

P and C on different cores same CPU

1.50 1.37 0.83

1.330.75 0.82 0.50 0.46

0

5

10

15

20

25

30

35

40

45

64 128 256 512 1k 2k 4k 8k

nanoseco

nd

s

buffer size

P and C on different CPUs

3.333.15

2.23

2.91

3.56

4.08

2.43

2.76

0

5

10

15

20

25

30

35

40

45

64 128 256 512 1k 2k 4k 8k

nanose

conds

buffer size

P and C on same core distinct contexts

0.19 0.14 0.12 0.11 0.09 0.11 0.11 0.11

different sockets

MPI is ~190 ns at best(D.K. Panda)

same socket same core different contexts same socket different cores different

Layer 1: Simple streaming networks

DRAFT

1 int size = N; //SPSC size

2 bool push(void∗ data) {3 if (buf w�>full()) {4 SPSC∗ t = pool.next w();

5 if (! t) return false;

6 buf w = t;

7 }8 buf w�>push(data);

9 return true;

10 }11 bool pop(void∗∗ data) {12 if (buf r�>empty()) {13 if (buf r == buf w) return false;

14 if (buf r�>empty()) {15 SPSC∗ tmp = pool.next r();

16 if (tmp) {17 pool. release (buf r) ;

18 buf r = tmp;

19 }20 }21 }22 return buf r�>pop(data);

23 }

25 struct Pool {26 dSPSC inuse;

27 SPSC cache;

29 SPSC∗ next w() {30 SPSC∗ buf;

31 if (!cache.pop(&buf))

32 buf = allocateSPSC(size);

33 inuse.push(buf);

34 return buf;

35 }36 SPSC∗ next r() {37 SPSC∗ buf;

38 return (inuse.pop(&buf)? buf : NULL);

39 }40 void release(SPSC∗ buf) {41 buf�>reset(); // reset pread and pwrite

42 if (!cache.push(buf))

43 deallocateSPSC(buf);

44 }45 }

Fig. 3: Unbounded wait-free uSPSC queue implementation.

impossible since, if the consumer switches to the next bu↵er while the previousone is not really empty, a data loss will occur. In the next section we prove thatthe if condition at line §3.�� is su�cient to ensure correct execution.

Theorem 2 (uSPSC). The uSPSC unbound queue sketched in Fig. 3 is correct(and wait-free) under any memory consistency model provided that it is built withinternal SPSC queues with size > 1.

Proof. The SPSC queue used as basic building block of the uSPSC queue hasbeen proved correct in [18,13]. Both the producer and the consumer initiallywork on the same bu↵er. The correct execution of pop and push is guaranteedby the correctness of the bu↵er (i.e. the internal SPSC queue) up to the momentthe bu↵er becomes full and the producer starts writing to a new bu↵er. We havetwo cases: the bu↵er the consumer is reading from is either non-empty or empty.In the former case, the correctness is again ensured by bu↵er correctness. Thelatter case is more subtle.

In a relaxed memory consistency model the consumer can be aware that theproducer has changed the writing bu↵er only with the writing of the data of theprevious push because of the Write Memory Barrier (WMB) at line §1.�. In fact,without the WMB, the new value of buf w (line §3.�) and the value to writtento the bu↵er (buf[pwrite] in the previous push at line §1.�) might appear inmemory in any order. Thus — in principle — it can be possible that the readingbu↵er buf r is still perceived as empty and a new writing bu↵er has been alreadystarted (buf r 6= buf w); the condition at line §3.�� could therefore evaluate totrue even if the previous bu↵er is not actually empty. This condition mightlead to data loss because the consumer might overtake and abandon a bu↵erstill holding a valid value. In the uSPSC implementation this cannot happen

M. Aldinucci, S. Campa, M. Danelutto, M. Torquati. An Efficient Synchronisation Mechanism for Multi-Core Systems.

EuroPar 2012.Wed 29 Aug - B3 multicore 14.30-16.00

tail head

dynamiclinked listof circular

buffers

P C

Layer 1: Simple streaming networks http://www.1024cores.net/home/technologies/fastflow

S2

S3 S4

S1 Sn

...

25

50K

100K

150K

200K

1 8 16 24 32 40 48 56 64

Thro

ugh

put (m

sg/s

)

n. threads

uSPSCdSPSC

dSPSC no cache

4 sockets x 8 core x 2 contexts

Xeon E7-4820 @2.0GHz Sandy Bridge 18MB L3 shared cache, 256K L2

Linked list w/ dyn allocMichael-Scott queue

well-known ~ 400 citations 0

50

100

150

200

250

300

350

400

450

64 1024 8192

nanose

conds

buffer size

mapping 1mapping 2mapping 3

Linked list with poolingOpt Michael-Scott queue

(our result)

20x faster

0

5

10

15

20

25

30

35

40

45

64 1024 8192

nanose

cond

s

buffer size

mapping 1mapping 2mapping 3

Linked list + circular bufferFastFlow queue

(our result)

12x faster

same coresame socketdifferent sockets

Speedup

<35 ns irrespectivelyof the mapping

Layer 3: streaming networks patterns

• Composition via C++ template meta-programming

• CPU: Graph composition

• GPU: CUDA streams

• CPU+GPU: offloading

• farm{ pipe }

• pipe(farm, farm)

• pipe(map, reduce)

• ....26

copy_D2H

kernel

copy_H2D

farm_start

copy_D2H

kernel

copy_H2D

copy_D2H

kernel

copy_H2D

farm_start

• farm• on CPU - master-worker - parallelism exploitation• on GPU - CUDA streams - automatic exploitation of asynch

comm

• pipeline• on CPU - pipeline• on GPU - sequence of kernel calls or global mem synch

• map• on CPU - master-worker - parallelism exploitation• on GPU - CUDA SIMT - parallelism exploitation

• reduce• on CPU - master-worker - parallelism exploitation• on GPU - CUDA SIMT (reduction tree) - parallelism exploitation

• D&C• on CPU - master-worker with feedback - // exploitation• on GPU - working on it, maybe loop+farm

Layer 3: streaming networks patterns

stream[k]

copy_D2H

kernel

copy_H2D

stream[0]

farm_start

stream[1] stream[n]

farm_start

Layer 3: streaming networks patterns(easy to port)

+ distributed

• Generic ff_node is subclassed to ff_dnode• ff_dnode can support network channels

• P2P or collective• used as frontier node of streaming graph• can be used to merge graphs across distributed platforms

• No changes to programming model• at least require to “add” stub ff_dnode• when passing pointers data is serialised

• serialisation hand-managed (zero-copy, think to Java!)

M. Aldinucci, S. Campa, M. Danelutto, M. Torquati, P. Kilpatrick. Targeting distributed systems in FastFlow. Should be next talk!

Streaming network patternsSkeletons: pipeline, map farm, reduce, D&C, ...

Arbitrary streaming networksLock-free SPSC/MPMC queues + FF nodes

Simple streaming networksLock-free SPSC queues + threading model

FastFlow

Multicore and manycoreSMP: cc-UMA & cc-NUMA

Applications on multicore, many core & distributed platforms of multicoresEfficient and portable - designed with high-level patterns

Distributed platformsClouds, clusters of SMPs

Simple streaming networksZero copy networking + processes model

Arbitrary streaming networksCollective communications + FF Dnodes

CWC simulator example

simeng

gath

er

simeng

disp

tach

incomplete simulation tasks (with load balancing)

farm

alig

nmen

t of

traje

ctor

ies

gene

ratio

n of

si

mul

atio

n ta

sks

simulation pipeline

start new simulations, steer and terminate running simulations

stateng

gath

er

disp

tach

farm

gene

ratio

n of

sl

idin

g w

indo

ws

of tr

ajec

torie

sraw simulation

results display ofresults

---Graphical

UserInterface

meanvariance k-means

filtered simulation

results

stateng

analysis pipelinemain pipeline

reducereduce reducereduce

a1 a2 a3 a4 b1 b2 b3 b4

c1 c2 c3 c4 d1 d2 d3 d4

e1 e2 e3 e4 f1 f2 f3 f4

{a,b}@P1

{c,d}@P2

{e,f}@P3

a1 b1

c1

e1

d1

f1

b2

c2

e2 f2

b3

a2

d2

e3 f3

c3

b4

d3

a3 e4

f4c4 d4

a4

{a,b}@P1

{c,d}@P2

{e,f}@P3

idle

F(a,...,f)

F(a,...,f)

idle

a1 b1

c1

e1

d1

f1

b2

c2

e2 f2

b3

a2

d2

e3 f3

c3

b4

d3

a3 e4

f4c4 d4

a4

{a,b}@P1

{c,d}@P2

{e,f}@P3

F2(a,...f)

simulationwall-clocktime

trajectory reduction

gainF1(a,...f)

F3(a,...f)

F4(a,...f)

ii)

iii)

i)

1's runs 2's runs 4's 3's

M. Aldinucci, M. Coppo, F. Damiani, M. Drocco, M. Torquati, A. Troina. On designing multicore-aware simulators for biological systems. PDP 2011. 2011. IEEE.

time

trajectories Trajectory 1

Trajectory n

...

simeng

gath

er

simeng

disp

tach

incomplete simulation tasks (with load balancing)

farm

alig

nmen

t of

traje

ctor

ies

gene

ratio

n of

si

mul

atio

n ta

sks

simulation pipeline

start new simulations, steer and terminate running simulations

stateng

gath

er

disp

tach

farm

gene

ratio

n of

sl

idin

g w

indo

ws

of tr

ajec

torie

sraw simulation

results display ofresults

---Graphical

UserInterface

meanvariance k-means

filtered simulation

results

stateng

analysis pipelinemain pipeline

time

trajectories

simulation-time aligned trajectories

simeng

gath

er

simeng

disp

tach

incomplete simulation tasks (with load balancing)

farm

alig

nmen

t of

traje

ctor

ies

gene

ratio

n of

si

mul

atio

n ta

sks

simulation pipeline

start new simulations, steer and terminate running simulations

stateng

gath

er

disp

tach

farm

gene

ratio

n of

sl

idin

g w

indo

ws

of tr

ajec

torie

sraw simulation

results display ofresults

---Graphical

UserInterface

meanvariance k-means

filtered simulation

results

stateng

analysis pipelinemain pipeline

time

trajectories

dispatch

task

task

task

simeng

gath

er

simeng

disp

tach

incomplete simulation tasks (with load balancing)

farm

alig

nmen

t of

traje

ctor

ies

gene

ratio

n of

si

mul

atio

n ta

sks

simulation pipeline

start new simulations, steer and terminate running simulations

stateng

gath

er

disp

tach

farm

gene

ratio

n of

sl

idin

g w

indo

ws

of tr

ajec

torie

sraw simulation

results display ofresults

---Graphical

UserInterface

meanvariance k-means

filtered simulation

results

stateng

analysis pipelinemain pipeline

simeng

gather

simeng

disp

tach

Scat

ter c

onsu

mer

From

Any

prod

ucer

mean variance

quantiles

QT clusters

k-means

M. Aldinucci, M. Coppo, F. Damiani, M. Drocco, E. Sciacca, S. Spinella, M. Torquati, A. Troina. On parallelizing on-line statistics for stochastic biological simulations. In Euro-Par 2011 Workshops, HiBB, vol 7155 of LNCS.

simeng

gath

er

simeng

disp

tach

incomplete simulation tasks (with load balancing)

farm

alig

nmen

t of

traje

ctor

ies

gene

ratio

n of

si

mul

atio

n ta

sks

simulation pipeline

start new simulations, steer and terminate running simulations

stateng

gath

er

disp

tach

farm

gene

ratio

n of

sl

idin

g w

indo

ws

of tr

ajec

torie

sraw simulation

results display ofresults

---Graphical

UserInterface

meanvariance k-means

filtered simulation

results

stateng

analysis pipelinemain pipeline

simeng

gather

simeng

disp

tach

Scat

ter c

onsu

mer

From

Any

prod

ucersim

eng

gath

er

simeng

disp

tach

sim farm

gene

ratio

n of

si

mul

atio

n ta

sks

(sca

tter p

rodu

cer)

Scat

ter c

onsu

mer

From

Any

prod

ucer

simulation pipeline (host 1)

simeng

gath

er

simeng

disp

tach

sim farm

Scat

ter c

onsu

mer

From

Any

prod

ucer

simulation pipeline (host n) alig

nmen

t of

traje

ctor

ies

(sca

tter c

ons)

farm of simulation pipelines

(distributed)

main pipeline (host 0)

stateng

gath

er

disp

tach

farm

gene

ratio

n of

sl

idin

g w

indo

ws

of tr

ajec

torie

sraw simulation

results display ofresults

---Graphical

UserInterface

meanvariance k-means

filtered simulation

results

stateng

analysis pipelinemain pipeline

gene

ratio

n of

si

mul

atio

n ta

sks

(sca

tter p

rodu

cer)

Scat

ter c

onsu

mer

Scat

ter c

onsu

mer

From

Any

prod

ucer

From

Any

prod

ucer

alig

nmen

t of

traje

ctor

ies

(sca

tter c

ons)

simeng

gath

er

simeng

disp

tach

sim farm

gene

ratio

n of

si

mul

atio

n ta

sks

(sca

tter p

rodu

cer) stat

eng

gath

er

disp

tach

farm

gene

ratio

n of

sl

idin

g w

indo

ws

of tr

ajec

torie

sraw simulation

results display ofresults

---Graphical

UserInterface

meanvariance k-means

filtered simulation

results

stateng

analysis pipeline

Scat

ter c

onsu

mer

From

Any

prod

ucer

simulation pipeline (host 1)

simeng

gath

er

simeng

disp

tach

sim farm

Scat

ter c

onsu

mer

From

Any

prod

ucer

simulation pipeline (host n) alig

nmen

t of

traje

ctor

ies

(sca

tter c

ons)

farm of simulation pipelines

(distributed)

start new simulations, steer and terminate running simulations

main pipeline (host 0)

serialisation, coalescingallocation (zero-copy 0MQ), de-

serialisation

simeng

gath

er

simeng

disp

tach

incomplete simulation tasks (with load balancing)

farm

alig

nmen

t of

traje

ctor

ies

gene

ratio

n of

si

mul

atio

n ta

sks

simulation pipeline

start new simulations, steer and terminate running simulations

stateng

gath

er

disp

tach

farm

gene

ratio

n of

sl

idin

g w

indo

ws

of tr

ajec

torie

sraw simulation

results display ofresults

---Graphical

UserInterface

meanvariance k-means

filtered simulation

results

stateng

analysis pipelinemain pipeline

simeng

gath

er

simeng

disp

tach

incomplete simulation tasks (with load balancing)

farm

alig

nmen

t of

traje

ctor

ies

gene

ratio

n of

si

mul

atio

n ta

sks stat

eng

gath

er

disp

tach

farm

gene

ratio

n of

sl

idin

g w

indo

ws

of tr

ajec

torie

sraw simulation

results display ofresults

---Graphical

UserInterface

meanvariance k-means

filtered simulation

results

stateng

simulation pipeline analysis pipeline

start new simulations, steer and terminate running simulations

main pipeline

Performance(preliminary)

0

5

10

15

20

25

30

35

5 10 15 20 25 30

spee

dup

n. sim. workers

ideal128 trajectories512 trajectories

1024 trajectories 0

5

10

15

20

25

30

35

5 10 15 20 25 30

spee

dup

n. sim. workers

ideal128 trajectories512 trajectories

1024 trajectories

IntelNehalem32-core

1 stat engine 4 stat engines

0

5

10

15

20

25

0 5 10 15 20 25

spee

dup

aggreated n. of cores

ideal2 cores per host4 cores per host

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

spee

dup

n.of hosts

ideal2 cores per host4 cores per host

Cluster 8xIntel Xeon 6-coreInfinband (IPoIB)

courtesy of Mellanox

4 stat engines 1024 trajectories4 stat engines 1024 trajectories

Involved data

• Simple examples (neurospora, ...)• 2-4 double * n. of variables * n. of samples x n.

of trajectories * cases in sensitivity analysis• e.g.4*8*4*1M*1k*8 ~ 1 TBytes

• HIV 6GB x 1024 trajectories ~ 6TB• The more observed variables, precision,

cases for sensitivity analysis the more data

• CMS: Compact Muon Solenoid at CERN• 3500 scientists, 180 Universities and Research Labs (40 countries)• CMS is like a ~75 MegaPixels Digital Camera. 40M “photos”/s

Selection of 300 ‘photos’/s ~450 MB/s from the detector are ~PBs of data/year

• CERN has (of course) its well established data flow and infrastructure, however ...

IMPACT Innovative Methods for Particle Colliders at the Terascale(2012-2015)

IMPACT Innovative Methods for Particle Colliders at the Terascale(2012-2015, oversimplified vision)

huge data Diff(somehow)

Monte Carlo simulationof “known” traces

Filtered datato be analysed

(much smaller)

huge data

Particle physicists Theoretic physicists

new knowledge

Monte Carlo simulationssmaller but frequent

requires great “reactivity”of the workflow to tune models

new speculations

Preliminary results onchannel H→ZZ→4l

shown at Higg’s boson claim Jul 2012

(Amapane at al.)

Formalising the cell cycle switch

• •

• o o

o

! !

Direct competition: unstable switch

CWC syntax

Courtesy of Luca Cardelli

On Switches and Oscillators Program

Equivalence in Biology?

http://lucacardelli.name

... after a number of transformations: a stable switch faithfully modelling cell switch

CWC syntax

Courtesy of Luca Cardelli

On Switches and Oscillators Program

Equivalence in Biology?

http://lucacardelli.name

o

• o

o

• o

o

o

• Demo: unstable switch

50

100

150

200

250

300

350

400

450

500

550

600

0 5 10 15 20

K-m

eans

clu

ster

s on

mol

ecul

es (a

)

time

0

100

200

300

400

500

600

700

0 5 10 15 20

num

ber o

f "a"

mol

ecul

es

time

raw simulationsstandard deviation

mean

Schlögl modelautocatalytic, trimolecular reaction scheme (bistable)

Bacteriophage λ life cycleintegration of a strand of DNA in the molecule of E. coli DNA(multi-stable)

0

10

20

30

40

50

60

70

0 20 40 60 80 100 120

# M

olec

ules

(CI)

time

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120

QT

clu

ste

rs w

ith t

ren

ds

(CI)

time

Transcriptional regulation in Neurospora(circadian clock period detection)

0.026

0.028

0.03

0.032

0.034

0.036

0.038

0.04

dark light dark light

Peak

freq

uenc

y

Light condition phases

0

200

400

600

800

1000

Ligh

t Con

ditio

n

0

200

400

600

800

1000

0 50 100 150 200

Dar

k C

ondi

tion

Time in hours

Conclusions• Talk focused on programming model

• Many important“in a cloud” ignored in the talk: middleware, faults, ...

• Data movement & high-level are key features• helps the mapping of features to platforms and performance portability• ease the design• MapReduce is an instance, should be not the only one

• Formal biology at embryonal stage• similar data & computation problems in analysis “in

vitro” experiments• increasing interest from industrial and “core” bio

scientist for parallel computing

• FastFlow• Massimo Torquati (Pisa), Marco Danelutto (Pisa), Peter Kilpatrick

(Belfast), Massimiliano Meneghin (IBM Research Dublin)

• Case studies, biological models• Luca Cardelli (Microsoft), Pietro Liò (Cambridge), Andrea

Bracciali (Stirling), Cristina Calcagno (Torino)

• CWC simulation implementation• Maurizio Drocco (Torino), Fabio Tordini (Torino)

• CWC language design• Maurizio Drocco, Eva Sciacca, Salvatore Spinella, Mario Coppo,

Angelo Troina, Ferruccio Damiani (Torino)

• Graphical User Interface• ETICA company

• Infiniband cluster and computing facilities• Mellanox, HPC Advisory Council

Acknowledgements

IMPACT

BioBITs