27
BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006 http://mpa.itc.it Silvano Paoli, Davide Albanese, Giuseppe Jurman, Annalisa Barla, Stefano Merler, Roberto Flor, Stefano Cozzini, James Reid, Cesare Furlanello

BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006 Silvano Paoli, Davide Albanese, Giuseppe

Embed Size (px)

Citation preview

Page 1: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

BioDCV: a grid-enabled complete validation setup

for functional profiling

BioDCV: a grid-enabled complete validation setup

for functional profiling

Trieste, Feb 2006

http://mpa.itc.ithttp://mpa.itc.it

Silvano Paoli, Davide Albanese, Giuseppe Jurman, Annalisa Barla, Stefano Merler, Roberto Flor, Stefano Cozzini, James Reid, Cesare Furlanello

Silvano Paoli, Davide Albanese, Giuseppe Jurman, Annalisa Barla, Stefano Merler, Roberto Flor, Stefano Cozzini, James Reid, Cesare Furlanello

Page 2: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe
Page 3: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Predictive ProfilingPredictive Profiling

QUESTIONS for a discriminating QUESTIONS for a discriminating molecular signature:molecular signature:

predictpredict disease state disease state identify patterns regarding identify patterns regarding

subclasses of patients subclasses of patients

Group AGroup A

Group BGroup B

ArrayArray(gene expression Affy) (gene expression Affy)

BB

Over-expressionOver-expressionin group Bin group B

Over-expressionOver-expressionin group Ain group A

B

genesgenesUnder-expressionUnder-expression

in group Bin group B

sam

ple

ssa

mple

s

A PANEL OF DISCRIMINATING GENES?A PANEL OF DISCRIMINATING GENES?

Page 4: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Starting from a suite of C modules and Perl/shell scripts running Starting from a suite of C modules and Perl/shell scripts running on a local HPC resource …on a local HPC resource …

1.1. Optimize modules and scriptsOptimize modules and scripts:* :* database management of data, of model structures, database management of data, of model structures,

of system outputs, scripts for OpenMosix Linux Clustersof system outputs, scripts for OpenMosix Linux Clusters

2.2. Wrap BioDCV into a grid applicationWrap BioDCV into a grid application Learn about grid computing Learn about grid computing Port the serial versionPort the serial version on a computationalon a computational grid testbed grid testbed Analyze/verify results: identify needs/problems Analyze/verify results: identify needs/problems

3.3. Wrap with C MPI scriptsWrap with C MPI scripts Build the MPI mechanism Build the MPI mechanism Experiment on the testbed Experiment on the testbed Submit Submit on productionon production grid grid Test scalability Test scalability

4.4. ProductionProduction

5.5. Egee Biomed/NA4Egee Biomed/NA4

Roadmap for a new grid applicationRoadmap for a new grid application

February 2005

March 05: Up and Running!

Nov 04 -January 2005

Sept-Dec 2004

Sept 05: 1500 jobs, 500+ computing days on production grid

Feb 2006: Egee

Page 5: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Rewrite shell/Perl scripts in C languageRewrite shell/Perl scripts in C language control I/O costs, control I/O costs, a process granularity optimal for a process granularity optimal for

temporary data allocation without tmp filestemporary data allocation without tmp files convenient for migrationsconvenient for migrations

SQLite interface (Database engine library)SQLite interface (Database engine library) SQLite is small, self-contained, embeddableSQLite is small, self-contained, embeddable It provides a relational access to model and data It provides a relational access to model and data

structures (inputs, outputs, diagnostics)structures (inputs, outputs, diagnostics) It supports transactions and multiple connections, It supports transactions and multiple connections,

databases up to 2 terabytes in sizedatabases up to 2 terabytes in size

local copy (db file):

+ model definitions+ a copy of of data + indexes defining the partition

of the replicate sample(s)

1. Optimize modules and scripts1. Optimize modules and scripts

Page 6: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

BioDCV

(1)exp : experiment design through configuration of the setup database

(2)scheduler : script submitting jobs (run) on each available processor. Platform dependent.

(3)run : performs fractions of the complete validation procedures on several data splits. Local db is created

(4)unify : the local datasets are merged with setup after completing the validations tasks. A complete dataset collecting all the relevant parameters is created.

Page 7: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Why porting into the grid?Why porting into the grid? Because we do not have “enough” Because we do not have “enough”

computational resources… computational resources…

How to port the BioDCV in grid?How to port the BioDCV in grid? PRELIMINARYPRELIMINARY

Identify a collaborator with experience in grid computing Identify a collaborator with experience in grid computing (e.g.(e.g. the Egrid Project hosted at ICTP the Egrid Project hosted at ICTP http://www.egrid.it http://www.egrid.it ))

Train human resources (SP Train human resources (SP Trieste) Trieste) Join the Egrid testbed (installing a Join the Egrid testbed (installing a grid sitegrid site in Trento) in Trento)

HANDS-ONHANDS-ON PortingPorting of the serial application on the testbed of the serial application on the testbed patch code as needed: code portability is mandatory to patch code as needed: code portability is mandatory to

make life easiermake life easier Identify requirements/problems Identify requirements/problems

2. Wrapping into a grid application2. Wrapping into a grid application

Page 8: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

A few EDG/LCG definitions A few EDG/LCG definitions

Storage Element (SE): stores the user data in the grid and makes it available for subsequent elaboration

Computing Element (CE): where the grid user programs are delivered for elaboration: this is usually a front-end to several elementary Worker Node machines

Worker Node (WN): machines where the user programs are actually executed, possibly with multiple CPUs

User Interface (UI): machine to access the GRID CE SE

m TByteWNs

N CPUs

site

Page 9: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

The local testbed in Trieste ( now gridats)The local testbed in Trieste ( now gridats) Small computational grid based on EDG middleware + Egrid add-onsSmall computational grid based on EDG middleware + Egrid add-ons Designed for testing/training/porting of applicationsDesigned for testing/training/porting of applications Full compatibility with Grid.it middlewareFull compatibility with Grid.it middleware

The production infrastructure:The production infrastructure: A Virtual Organization within Grid.it, with its own services A Virtual Organization within Grid.it, with its own services Star topology with central node in Padova ( only in version 1) Star topology with central node in Padova ( only in version 1)

CE SE2.8 TByte

WNs 100 cpus

PadovaCE+SE+WN

TrentoCE+SE+WNRoma

CE+SE+WN

Trieste

CE+SE+WNFirenze

CE+SE+WNPalermo

The ICTP Egrid project infrastructuresThe ICTP Egrid project infrastructures

Page 10: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Porting the serial application Porting the serial application Easy task due to portability (no actual work needed)Easy task due to portability (no actual work needed) No software/library dependencies No software/library dependencies

Testing/Evaluation Testing/Evaluation Problems identified: Problems identified:

Job submission overhead due to EDG mechanisms Job submission overhead due to EDG mechanisms Managing multiple (~hundreds/thousands) jobs is Managing multiple (~hundreds/thousands) jobs is

difficult and cumbersomedifficult and cumbersome Answer: parallellize jobs on the GRID via MPIAnswer: parallellize jobs on the GRID via MPI

Single submission Single submission Multiple executions Multiple executions

Hands on Hands on

Page 11: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

How can we use C MPI?How can we use C MPI?

Prepare two wrappers, and an unifierPrepare two wrappers, and an unifier one shell script to submit jobs (one shell script to submit jobs (BioDCV.sh)BioDCV.sh) one C MPI program (one C MPI program (Mpba-mpi)Mpba-mpi) one shell script to integrate results one shell script to integrate results ((BioDCV-union.shBioDCV-union.sh))

BioDCV.sh in action:BioDCV.sh in action: copies file from and to Storage Element (SE) and distributes copies file from and to Storage Element (SE) and distributes

the microarray dataset to all WNs. the microarray dataset to all WNs. It then starts the C MPI wrapper which spawns several runs It then starts the C MPI wrapper which spawns several runs

of the BioDCV program (optimize for resources) of the BioDCV program (optimize for resources) When all BioDCV runs are completed, the wrapper copies When all BioDCV runs are completed, the wrapper copies

all the results (SQLite files) from the WNs to the starting SE.all the results (SQLite files) from the WNs to the starting SE. MPBA-MPI executes the BioDCV runs in parallelMPBA-MPI executes the BioDCV runs in parallel BioDCV-union.sh collates results in one SQLite file (BioDCV-union.sh collates results in one SQLite file ( R) R)

C M

PI

C M

PI

3. Wrap with C MPI scripts 3. Wrap with C MPI scripts

Page 12: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Using BioDCV in EgridUsing BioDCV in Egrid

UI Egrid Live CD*

Resource broker (PD-TN)

CE SE2.8 TByte

WNs 100 cpus

CE+SE+WNCE+SE+WN

CE+SE+WN

Padova

TriestePalermo

Trento

. . . .

“Edg-job-submit bioDCV.jdl”

site

a bootable Linux live-cd distribution with a complete suit of GRID tools by Egrid (ICTP Trieste)

Page 13: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

[ Type = "Job"; JobType = "MPICH"; NodeNumber = 64; Executable = “BioDCV.sh"; Arguments = “Mpba-mpi 64 lfn:/utenti/spaoli/sarcoma.db 400"; StdOutput = "test.out"; StdError = "test.err"; InputSandbox = {“BioDCV.sh",“Mpba-mpi","run", "run.sh"}; OutputSandbox = {"test.err","test.out","executable.out"}; Requirements = other.GlueCEInfoLRMSType == "PBS" || other.GlueCEInfoLRMSType == "LSF";]

BioDCV.jdl

A Job DescriptionA Job Description

Page 14: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Second step:

. . . WN 1

WN 2

WN 3

WN n

Mpba-mpi and Sarcoma.db are distributed to all the

involved WNsBioDCVSarcoma.db

WN 1

BioDCV.sh runs on

Request file

sarcoma.db

SE

First step:

BioDCV.sh copies data from

SE to the WN

Using BioDCV in Egrid (II)Using BioDCV in Egrid (II)

Page 15: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

. . .

SE

Fourth step:

Output

Output

Output

Output

WN 1

BioDCV.sh copies all results (SQLite files) from the WNs to the starting SE

BioDCV.sh runs on

. . .

Third step:

WN 1

BioDCV is executed on all involved WNs

by MPI

Mpba-mpi runs onWN2BioDCV

runs onMpba-mpi on WN3

Mpba-mpi on WN n

Job completed

Job completed

Job completed

Using BioDCV in Egrid (III)Using BioDCV in Egrid (III)

Page 16: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

RUNNING ON THE TESTBED (EGRID.IT)MARCH 2005

INT-IFOM Sarcoma

0

10000

20000

30000

40000

50000

60000

70000

1 2 4 8 16 32 64

cpu no.

Tim

e (

se

co

nd

s)

Computing File Copying Total Time

CPU no. Computing

(sec)

Copying files

(sec)

Total time

(secondi)

1 65883 25 66018

2 22566 46 22749

32 507,14 617 1454

64 554,37 1263 2389

CPUs: Intel Xeon @ 2.80 GHz

SCALING UP TESTS

Colon cancer

0

100

200

300

400

500

600

700

800

900

1 2 4 8 16

a.

b.INT-IFOM Sarcoma dataset

7143 genes35 samples

a.

Colon cancer dataset2000 genes62 samples

b.

Page 17: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

The pros:The pros:

MPI execution on the GRID in a few days.. MPI execution on the GRID in a few days.. The tests showed scalable behavior of our grid application for The tests showed scalable behavior of our grid application for

increasing numbers of CPUsincreasing numbers of CPUs Grid computing reduces significantly production times and allows to Grid computing reduces significantly production times and allows to

tackle larger problems (see next slide) tackle larger problems (see next slide)

The cons: The cons:

DData ata movements movements limit the scalabilitylimit the scalability for a large number of CPU’s for a large number of CPU’s Note: this is a GRID.it limitation: there is no shared Filesystem Note: this is a GRID.it limitation: there is no shared Filesystem

between the WNs, so each file needs to be copied everywhere! between the WNs, so each file needs to be copied everywhere!

To hide the latency (ideas):To hide the latency (ideas): Smart data distribution from MWN to WN’s:Smart data distribution from MWN to WN’s:

Reduce the amount of data to be movedReduce the amount of data to be moved Proportionate BioDCV subtasks to local cacheProportionate BioDCV subtasks to local cache

Data transferred via MPI communication Data transferred via MPI communication Requires MPI coding and some MPI adaptation of the code)Requires MPI coding and some MPI adaptation of the code)

Results of Phase 1Results of Phase 1

Page 18: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Phase 2: Improving the system Phase 2: Improving the system

Reduce the amount of data to be movedReduce the amount of data to be moved

1.1. Redesign “per run”:Redesign “per run”: SVM models (about 200) andSVM models (about 200) and results, evaluationresults, evaluation Variables for semisupervised analysisVariables for semisupervised analysis

all managed within one data structureall managed within one data structure

2.2. A large part of the sampletracking semisupervised analysis,A large part of the sampletracking semisupervised analysis,is now managed within BioDCV (about 2000 files, 300MBis now managed within BioDCV (about 2000 files, 300MB) ) i.e. stored through SQLite. i.e. stored through SQLite.

3.3. Randomization of labels is fully automated Randomization of labels is fully automated

4.4. The SVM library is now an external library:The SVM library is now an external library: Modular use of machine learning methodsModular use of machine learning methods Now being added: a PDA moduleNow being added: a PDA module

5.5. BioDCV now under GPL (code curation …) BioDCV now under GPL (code curation …)

6.6. Distributed at BioDCV with a SubVersion server Distributed at BioDCV with a SubVersion server since September 2005since September 2005

PDA: Penalized Discriminant AnalysisPDA: Penalized Discriminant Analysis

Page 19: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

1.1. At work on several clusters:At work on several clusters: MPBA-old: 50 P3 CPUs, 1GHzMPBA-old: 50 P3 CPUs, 1GHz MPBA-new: 6 Xeon CPUs, 2,8 GHzMPBA-new: 6 Xeon CPUs, 2,8 GHz ECT* (BEN): up to 32 (of 100) CPU Xeon, 2.8GHzECT* (BEN): up to 32 (of 100) CPU Xeon, 2.8GHz ICTP cluster: up to 8 (of 60) P4, 2GHz, MyrinetICTP cluster: up to 8 (of 60) P4, 2GHz, Myrinet

2.2. GRID experiences GRID experiences A.A. Egrid “production grid” (INFN Padua): Egrid “production grid” (INFN Padua):

up to 64 (of 100) Cpu Xeon, 2-3GHzup to 64 (of 100) Cpu Xeon, 2-3GHzMicroarray data: Sarcoma, HS random, Morishita, Wang, …Microarray data: Sarcoma, HS random, Morishita, Wang, …

B.B. LESSONS LEARNED: LESSONS LEARNED:

i.i. the latest version reduces latencies (system times) due to file copying the latest version reduces latencies (system times) due to file copying and management and management CPU saturation CPU saturation

ii.ii. Life quality (and more studies): huge reduction of file installing and Life quality (and more studies): huge reduction of file installing and retrieving from facilities and WITHIN facilities retrieving from facilities and WITHIN facilities

iii.iii. Forgetting the severe limitation of file system (AFS, …)Forgetting the severe limitation of file system (AFS, …)iv.iv. Now installing 2.6 LCG2 (CERN release September 2005) Now installing 2.6 LCG2 (CERN release September 2005)

CLUSTER AND GRID ISSUESCLUSTER AND GRID ISSUES

Page 20: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

In this section we present two experiments designed to measure the In this section we present two experiments designed to measure the performance of the BioDCV parallel application in two different performance of the BioDCV parallel application in two different computing available environments: a computing available environments: a standard linux clusterstandard linux cluster and and computational gridcomputational grid..

SPEED-UPSPEED-UP

In Benchmark 1, we study the scalability of our application as In Benchmark 1, we study the scalability of our application as function of the number of CPUs (from 1 to 64).function of the number of CPUs (from 1 to 64).

FOOTPRINTFOOTPRINT

In Benchmark 2, we characterize the BioDCV appllication with respect In Benchmark 2, we characterize the BioDCV appllication with respect to different dataset, i.e. for different number of feature (d) and to different dataset, i.e. for different number of feature (d) and number of samples (N) for the complete validation experiment (for a number of samples (N) for the complete validation experiment (for a fixed number of 32 CPUs).fixed number of 32 CPUs).

NOVEMBER/DECEMBER 2005 – TEST CLUSTER AND GRIDNOVEMBER/DECEMBER 2005 – TEST CLUSTER AND GRID

Page 21: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Tasks for BioDCV (E-RFE SVM) - DICEMBER 2005Tasks for BioDCV (E-RFE SVM) - DICEMBER 2005

Liver cancer: 213 samples, 107 tumors from liver cancer +106 non tumoral/normal, 1993 genes, ATAC-PCR (Sese et. al, 2000)

Breast cancer:

Wang et al. 2005: 238 samples (286 lymph-node-negative), Affimetrix, 17819 genes

Chang et al. 2005: 295 samples (151 lymph-node-negative, 144 pos), cDNA 25000 genes

IFOM: 62 BRCA (4 subclasses)

Pediatric Leukemia: 327 samples, 12626 genes (7 classes, binary: 284+43), Yeoh et al. 2002

Sarcoma: 37 samples, 7143 genes (4 classes)

Benchmark1Benchmark1 Benchmark2Benchmark2

Page 22: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

DICEMBER 2005 - SPEED-UP (Cluster + and GRID)DICEMBER 2005 - SPEED-UP (Cluster + and GRID)

N.Cpu

Sp

ee

du

p

1 2 4 8

0

1

2

3

4

5

6

7

8LiverCanc: cluster

Experimental dataLinear Speed

N.Cpu

Sp

ee

du

p1 2 4 8

0

1

2

3

4

5

6

7

8PedLeuk: cluster

Experimental dataLinear Speed

Page 23: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

(DICEMBER 2005) FOOTPRINT GRID(DICEMBER 2005) FOOTPRINT GRID

dN x 10e-7

Tim

e (

s)

1 2 3 4 5 6 7 8

1500

10000

50000

100000

T_tns

E_g

10 x L_i

10 x U

10 x S

BRCA ChangMorishita

PL

Sarcoma

Wang

FOOTPRINT dN: #genes x #samples FOOTPRINT dN: #genes x #samples

Page 24: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

Two experiments for 139 CPU day in Egrid infrastructureTwo experiments for 139 CPU day in Egrid infrastructure

In benchmark 1, we obtain a speed-up curve very close to linearIn benchmark 1, we obtain a speed-up curve very close to linear

In Benchmark 2, effective execution time increases linearly with the In Benchmark 2, effective execution time increases linearly with the dataset footprint, i.e. the product of number of genes and number of dataset footprint, i.e. the product of number of genes and number of samplessamples

Performance penalty payed for data transfer from WNs and queuing Performance penalty payed for data transfer from WNs and queuing policy (best effort) respect on a local linux clusterpolicy (best effort) respect on a local linux cluster

We will investigate if substituting an MPI approach with a model that We will investigate if substituting an MPI approach with a model that submit N different single CPU jobssubmit N different single CPU jobs

BioDCV system on LCG/EGEE computational grid can be used in BioDCV system on LCG/EGEE computational grid can be used in pratical large scale experimentspratical large scale experiments

Next is porting our system under EGEE’s Biomed VO.Next is porting our system under EGEE’s Biomed VO.

NOVEMBER/DECEMBER 2005 –DISCUSSIONNOVEMBER/DECEMBER 2005 –DISCUSSION

Page 25: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

•BASIC CLASSIFICATION: MODELS, lists, additional tools

•Tools for researchers: subtype discovery, outlier detection

•Connection to data (DB–MIAME)

•BASIC CLASSIFICATION: MODELS, lists, additional tools

•Tools for researchers: subtype discovery, outlier detection

•Connection to data (DB–MIAME)

Challenges (AIRC-BICG)Challenges (AIRC-BICG)

HPC-Interaction: access through webportal to GRID HPCHPC-Interaction: access through webportal to GRID HPC

www

Apa

che

Page 26: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe
Page 27: BioDCV: a grid-enabled complete validation setup for functional profiling Trieste, Feb 2006  Silvano Paoli, Davide Albanese, Giuseppe

AcknowledgmentsAcknowledgments

ITC-irst, Trento

Alessandro Soraruf

ICTP E-GRID Project, Trieste

Angelo LetoCristian Zoicas

Riccardo MurriEzio CorsoAlessio Terpin

ITC-irst, Trento

Alessandro Soraruf

ICTP E-GRID Project, Trieste

Angelo LetoCristian Zoicas

Riccardo MurriEzio CorsoAlessio Terpin

IFOM-FIRC and INT, Milano

Manuela GariboldiMarco A. Pierotti

Grants:

BICG (AIRC)Democritos

Data:

IFOM-FIRCCCardiogenomics PGA

IFOM-FIRC and INT, Milano

Manuela GariboldiMarco A. Pierotti

Grants:

BICG (AIRC)Democritos

Data:

IFOM-FIRCCCardiogenomics PGA