Upload
bartholomew-jacobs
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
BioDCV: a grid-enabled complete validation setup
for functional profiling
BioDCV: a grid-enabled complete validation setup
for functional profiling
Wannsee Retreat, October 2005
http://mpa.itc.ithttp://mpa.itc.it
Cesare Furlanello with Silvano Paoli, Davide Albanese. Giuseppe Jurman, Annalisa Barla, Stefano Merler, Roberto Flor
Cesare Furlanello with Silvano Paoli, Davide Albanese. Giuseppe Jurman, Annalisa Barla, Stefano Merler, Roberto Flor
Algorithms and software systems for
1. Predictive classification, feature selection, discovery
Algorithms and software systems for
1. Predictive classification, feature selection, discovery
Our BioDCV system: a set-up based on the E-RFE algorithm for Support Vector Machines (SVM)
•Control of selection bias, a serious experimental design issue in the use of prognostic molecular signatures
•Subtype identification for studies of disease evolution and response to treatment
Our BioDCV system: a set-up based on the E-RFE algorithm for Support Vector Machines (SVM)
•Control of selection bias, a serious experimental design issue in the use of prognostic molecular signatures
•Subtype identification for studies of disease evolution and response to treatment
Predictive classification and functional profilingPredictive classification and functional profiling
“In conclusion, the list of genes included in a molecular signature (based on one training set and the proportion of misclassifications seen in one validation set) depends greatly on the selection of the patients in training sets.”
“Five of the seven largest published studies addressing cancer prognosis did not classify patients better than chance. This result suggests that these publications were overoptimistic.”
John P A IoannidisFebruary 5, 2005
Selection biasSelection bias
“the 95% CI for the proportion of misclassifications fell to below50% for some training-set sizes in only two of the studies”
“We noted unstable molecular signatures and misclassification rates (with minimum rates between 31% and 49%).”
Michiels et al, Lancet 2005
The authors present a novel algorithm for classification, preprocessing , feature selection, … A description is available in XXX and the algorithm is publicly available as a Windows program/ website/ R package YYY.
BUT, HAVE THEY ANSWERED THE FOLLOWING QUESTIONS?
1. Which classification result could be achieved using standard algorithms and is there a difference in classification quality between a standard algorithm and the proposed one?
2. If there is a substantial difference, what is the reason?
The authors present a novel algorithm for classification, preprocessing , feature selection, … A description is available in XXX and the algorithm is publicly available as a Windows program/ website/ R package YYY.
BUT, HAVE THEY ANSWERED THE FOLLOWING QUESTIONS?
1. Which classification result could be achieved using standard algorithms and is there a difference in classification quality between a standard algorithm and the proposed one?
2. If there is a substantial difference, what is the reason?
M. Ruschhaupt et al (2004) "A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks", Statistical Applications in Genetics and Molecular Biology: 3 (1), Article 37.
A NEW PAPER ON PREDICTIVE GENE PROFILING … A NEW PAPER ON PREDICTIVE GENE PROFILING …
INGREDIENTS: DATA + METHODS (- EXPERIMENTAL SETUP?)INGREDIENTS: DATA + METHODS (- EXPERIMENTAL SETUP?)
A NEW PAPER ON PREDICTIVE GENE PROFILING … A NEW PAPER ON PREDICTIVE GENE PROFILING …
INGREDIENTS: DATA + METHODS (- EXPERIMENTAL SETUP?)INGREDIENTS: DATA + METHODS (- EXPERIMENTAL SETUP?)
M. Ruschhaupt et al (2004) "A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks", Statistical Applications in Genetics and Molecular Biology: 3 (1), Article 37.
METAGENE RF-M PAM-M
PLR-M SVM-M
BBT-M Size
No Recurrence
3 2 3 2 1 34
Recurrence
10 9 12 12 9 18
All 13 11 15 14 10 52
NO METAG.
RF PAM PLR SVM BBT Size
No Recurrence
2 3 2 2 2 34
Recurrence
12 8 12 12 10 18
All 14 11 14 14 12 52
REANALYSIS OF DATASET: Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng CF, Bild A, Iversen ES, Liao M, Chen CM, West M, Nevins JR, Huang AT. Gene expression predictors of breast cancer outcomes. The Lancet 361:1590–1596 (2003).
ERRORS FOR DIFFERENT CLASSIFIERS AND WITH/WITHOUT METAGENES IN COMPLETE VALIDATION RF: random forest; PAM: Class prediction by nearest shrunken centroids; PLR: Penalized logistic regression; SVM: Support Vector Machines; BBT: Bayesian Binary Prediction Tree Models; Metagenes: new variables from linear combinations
Misclassification rates of around 25% with all eight methods.
The use of metagenes did not seem to make a big difference either way.
Most of the misclassified samples come from the group of patients with recurrence, which is the smaller group. Possibly, this could be explained by a preference of the classification algorithms to favour the larger group.
Misclassification rates of around 25% with all eight methods.
The use of metagenes did not seem to make a big difference either way.
Most of the misclassified samples come from the group of patients with recurrence, which is the smaller group. Possibly, this could be explained by a preference of the classification algorithms to favour the larger group.
M. Ruschhaupt et al (2004) "A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks", Statistical Applications in Genetics and Molecular Biology: 3 (1), Article 37.
RESULTSRESULTSRESULTSRESULTS
THEN: HOW TO EVALUATE ACCURACY IN PREDICTION?THEN: HOW TO EVALUATE ACCURACY IN PREDICTION?THEN: HOW TO EVALUATE ACCURACY IN PREDICTION?THEN: HOW TO EVALUATE ACCURACY IN PREDICTION?
To avoid selection bias (p>>n): a COMPLETE VALIDATION SCHEME*
• externally a stratified random partitioning,• internally a model selection based on a K-fold cross-validation
3 x 105 SVM models (+ random labels 2 x 106) **
** Binary classification, on a 20 000 genes x 45 cDNA array, 400 loops
* Ambroise & McLachlan, 2002, Simon et. al 2003, Furlanello et. al 2003
OFS-M: Model tuning and Feature ranking
ONF: Optimal gene panel estimator
ATE: Average Test Error
The BioDCV setup (E-RFE SVM)The BioDCV setup (E-RFE SVM)
Tasks for BioDCV (E-RFE SVM)Tasks for BioDCV (E-RFE SVM)
Lymphoma: 96 samples (74 tumoral + 24 control) described by 4096 genes (Alizadeh et. al, 2000)
Tumor vs. Metastases: 76 samples (64 primary tumoral + 12 metastatic) described by 16063 genes (Ramaswamy et al., 2001)
High Sezary CTCL: 30 samples (18 disease + 12 control) described by 6660 genes (Kari et al., 2003) – coll. Wistar Inst.
Glioma: 50 samples (28 glioblastoma + 22 oligodendroglioma) described by 12 627 genes (Nutt et al., 2003)
Breast cancer: 37 samples (18 high risk + 19 low risk) described by 12 625 genes (Huang et al., 2003)
Mouse Model of Myocardial Infarction: 36 samples (18 infarcted + 18 control) described by 12488 geni (Cardiogenomics PGA http://cardiogenomics.med.harvard.edu, 2003)
Colon cancer: 62 samples (40 tumoral + 22 control) described by 2000 genes (Alon et. al, 1999)
Tasks for BioDCV (E-RFE SVM)Tasks for BioDCV (E-RFE SVM)
Liver cancer: 213 samples, 107 tumors from liver cancer +106 non tumoral/normal, 1993 genes, ATAC-PCR (Sese et. al, 2000)
Breast cancer:
Wang et al. 2005: 238 samples (286 lymph-node-negative), Affimetrix, 17819 genes
Chang et al. 2005: 295 samples (151 lymph-node-negative, 144 pos), cDNA 25000 genes
IFOM: 62 BRCA (4 subclasses)
Pediatric Leukemia: 327 samples, 12626 genes (7 classes, binary: 284+43), Yeoh et al. 2002
Tumor vs. Metastases: 76 samples (64 primary tumoral + 12 metastatic) described by 16063 genes (Ramaswamy et al., 2001)
High Sezary CTCL: 30 samples (18 disease + 12 control) described by 6660 genes (Kari et al., 2003) – coll. Wistar Inst.
Glioma: 50 samples (28 glioblastoma + 22 oligodendroglioma) described by 12 627 genes (Nutt et al., 2003)
Breast cancer: 37 samples (18 high risk + 19 low risk) described by 12 625 genes (Huang et al., 2003)
Mouse Model of Myocardial Infarction: 36 samples (18 infarcted + 18 control) described by 12488 geni (Cardiogenomics PGA http://cardiogenomics.med.harvard.edu, 2003)
Dataset Dimension Job no Svm Tot (h) Cluster (h)
Breast Cancer
37 x 12625
200 158 961 1177 36
MM NILV –LV 41 x 12488
400 333 163 2234 59
Glioma 50 x 12627
200 146 262 1864 57
High sezarySurvival
49 x 2244 200 126 150 63 1,4
High sezaryWithout C0032
29 X 6600 400 309 681 655 23
Our HPC resource, MpaCluster, 6Xeon+ 40 Pentium CPU, OpenMosix, 3 TeraB central storage. Upgraded in 2005. production in GRID
Our HPC resource, MpaCluster, 6Xeon+ 40 Pentium CPU, OpenMosix, 3 TeraB central storage. Upgraded in 2005. production in GRID
2003-2004, P3 biproc.
1. With a Linux OpenMosix HPC facility1. With a Linux OpenMosix HPC facility
Starting from a suite of C modules and Perl/shell scripts running Starting from a suite of C modules and Perl/shell scripts running on a local HPC resource …on a local HPC resource …
1.1. Optimize modules and scriptsOptimize modules and scripts:* :* database management of data, of model structures, database management of data, of model structures,
of system outputs, scripts for OpenMosix Linux Clustersof system outputs, scripts for OpenMosix Linux Clusters
2.2. Wrap BioDCV into a grid applicationWrap BioDCV into a grid application Learn about grid computing Learn about grid computing Port the serial versionPort the serial version on a computationalon a computational grid testbed grid testbed Analyze/verify results: identify needs/problems Analyze/verify results: identify needs/problems
3.3. Wrap with C MPI scriptsWrap with C MPI scripts Build the MPI mechanism Build the MPI mechanism Experiment on the testbed Experiment on the testbed Submit Submit on productionon production grid grid Test scalabilityTest scalability
4.4. ProductionProduction March 05:Up and Running!
February 2005
November-04 January 2005
Sept-Dec 04
Sept 05: 1500 jobs,500+ computing days on production grid
Roadmap for a new grid applicationRoadmap for a new grid application
Rewrite shell/Perl scripts in C languageRewrite shell/Perl scripts in C language control I/O costs, control I/O costs, a process granularity optimal for a process granularity optimal for
temporary data allocation without tmp filestemporary data allocation without tmp files convenient for migrationsconvenient for migrations
SQLite interface (Database engine library)SQLite interface (Database engine library) SQLite is small, self-contained, embeddableSQLite is small, self-contained, embeddable It provides a relational access to model and data It provides a relational access to model and data
structures (inputs, outputs, diagnostics)structures (inputs, outputs, diagnostics) It supports transactions and multiple connections, It supports transactions and multiple connections,
databases up to 2 terabytes in sizedatabases up to 2 terabytes in size
local copy (db file):
+ model definitions+ a copy of of data + indexes defining the partition
of the replicate sample(s)
1. Optimize modules and scripts1. Optimize modules and scripts
BioDCV
(1)exp : experiment design through configuration of the setup database
(2)scheduler : script submitting jobs (run) on each available processor. Platform dependent.
(3)run : performs fractions of the complete validation procedures on several data splits. Local db is created
(4)unify : the local datasets are merged with setup after completing the validations tasks. A complete dataset collecting all the relevant parameters is created.
Why porting into the grid?Why porting into the grid? Because we need “enough” Because we need “enough”
computational resources… computational resources…
How to port the BioDCV in grid?How to port the BioDCV in grid? PRELIMINARYPRELIMINARY
Identify a collaborator with experience in grid computing Identify a collaborator with experience in grid computing (e.g.(e.g. the Egrid Project hosted at ICTP the Egrid Project hosted at ICTP http://www.egrid.it http://www.egrid.it ))
Train human resources (SP Train human resources (SP Trieste) Trieste) Join the Egrid testbed (installing a supernode in Trento)Join the Egrid testbed (installing a supernode in Trento)
HANDS-ONHANDS-ON PortingPorting of the serial application on the testbed of the serial application on the testbed patch code as needed: code portability is mandatory to patch code as needed: code portability is mandatory to
make life easiermake life easier Identify requirements/problems Identify requirements/problems
2. Wrapping into a grid application2. Wrapping into a grid application
A few EDG definitions A few EDG definitions
Storage Element (SE): stores the user data in the grid and makes it available for subsequent elaboration
Computing Element (CE): where the grid user programs are delivered for elaboration: this is usually a front-end to several elementary Worker Node machines
Worker Node (WN): machines where the user programs are actually executed, possibly with multiple CPUs
User Interface (UI): machine to access the GRID CE SE
m TByteWNs
N CPUs
site
The local testbed in TriesteThe local testbed in Trieste Small computational grid based on EDG middleware + Egrid add-onsSmall computational grid based on EDG middleware + Egrid add-ons Designed for testing/training/porting of applicationsDesigned for testing/training/porting of applications Full compatibility with Grid.it middlewareFull compatibility with Grid.it middleware
The production infrastructure:The production infrastructure: A Virtual Organization within Grid.it, with its own services A Virtual Organization within Grid.it, with its own services Star topology with central node in Padova Star topology with central node in Padova
CE SE2.8 TByte
WNs 100 cpus
PadovaCE+SE+WN
TrentoCE+SE+WNRoma
CE+SE+WN
Trieste
CE+SE+WNFirenze
CE+SE+WNPalermo
The ICTP Egrid project infrastructuresThe ICTP Egrid project infrastructures
Porting the serial application Porting the serial application Easy task due to portability (no actual work needed)Easy task due to portability (no actual work needed) No software/library dependencies No software/library dependencies
Testing/Evaluation Testing/Evaluation Problems identified: Problems identified:
Job submission overhead due to EDG mechanisms Job submission overhead due to EDG mechanisms Managing multiple (~hundreds/thousands) jobs is Managing multiple (~hundreds/thousands) jobs is
difficult and cumbersomedifficult and cumbersome Answer: parallellize jobs on the GRID via MPIAnswer: parallellize jobs on the GRID via MPI
Single submission Single submission Multiple executions Multiple executions
Hands on Hands on
How can we use C MPI?How can we use C MPI?
Prepare two wrappers, and an unifierPrepare two wrappers, and an unifier one shell script to submit jobs (one shell script to submit jobs (BioDCV.sh)BioDCV.sh) one C MPI program (one C MPI program (Mpba-mpi)Mpba-mpi) one shell script to integrate results one shell script to integrate results ((BioDCV-union.shBioDCV-union.sh))
BioDCV.sh in action:BioDCV.sh in action: copies file from and to Storage Element (SE) and distributes copies file from and to Storage Element (SE) and distributes
the microarray dataset to all WNs. the microarray dataset to all WNs. It then starts the C MPI wrapper which spawns several runs It then starts the C MPI wrapper which spawns several runs
of the BioDCV program (optimize for resources) of the BioDCV program (optimize for resources) When all BioDCV runs are completed, the wrapper copies When all BioDCV runs are completed, the wrapper copies
all the results (SQLite files) from the WNs to the starting SE.all the results (SQLite files) from the WNs to the starting SE. MPBA-MPI executes the BioDCV runs in parallelMPBA-MPI executes the BioDCV runs in parallel BioDCV-union.sh collates results in one SQLite file (BioDCV-union.sh collates results in one SQLite file ( R) R)
C M
PI
C M
PI
3. Wrap with C MPI scripts 3. Wrap with C MPI scripts
Using BioDCV in EgridUsing BioDCV in Egrid
UI Egrid Live CD*
Resource broker (PD-TN)
CE SE2.8 TByte
WNs 100 cpus
CE+SE+WNCE+SE+WN
CE+SE+WN
Padova
TriestePalermo
Trento
. . . .
“Edg-job-submit bioDCV.jdl”
site
a bootable Linux live-cd distribution with a complete suit of GRID tools by Egrid (ICTP Trieste)
[ Type = "Job"; JobType = "MPICH"; NodeNumber = 64; Executable = “BioDCV.sh"; Arguments = “Mpba-mpi 64 lfn:/utenti/spaoli/sarcoma.db 400"; StdOutput = "test.out"; StdError = "test.err"; InputSandbox = {“BioDCV.sh",“Mpba-mpi","run", "run.sh"}; OutputSandbox = {"test.err","test.out","executable.out"}; Requirements = other.GlueCEInfoLRMSType == "PBS" || other.GlueCEInfoLRMSType == "LSF";]
BioDCV.jdl
A Job DescriptionA Job Description
Second step:
. . . WN 1
WN 2
WN 3
WN n
Mpba-mpi and Sarcoma.db are distributed to all the
involved WNsBioDCVSarcoma.db
WN 1
BioDCV.sh runs on
Request file
sarcoma.db
SE
First step:
BioDCV.sh copies data from
SE to the WN
Using BioDCV in Egrid (II)Using BioDCV in Egrid (II)
. . .
SE
Fourth step:
Output
Output
Output
Output
WN 1
BioDCV.sh copies all results (SQLite files) from the WNs to the starting SE
BioDCV.sh runs on
. . .
Third step:
WN 1
BioDCV is executed on all involved WNs
by MPI
Mpba-mpi runs onWN2BioDCV
runs onMpba-mpi on WN3
Mpba-mpi on WN n
Job completed
Job completed
Job completed
Using BioDCV in Egrid (III)Using BioDCV in Egrid (III)
RUNNING ON THE TESTBED (EGRID.IT)
INT-IFOM Sarcoma
0
10000
20000
30000
40000
50000
60000
70000
1 2 4 8 16 32 64
cpu no.
Tim
e (
se
co
nd
s)
Computing File Copying Total Time
CPU no. Computing
(sec)
Copying files
(sec)
Total time
(secondi)
1 65883 25 66018
2 22566 46 22749
32 507,14 617 1454
64 554,37 1263 2389
CPUs: Intel Xeon @ 2.80 GHz
SCALING UP TESTS
Colon cancer
0
100
200
300
400
500
600
700
800
900
1 2 4 8 16
a.
b.INT-IFOM Sarcoma dataset
7143 genes35 samples
a.
Colon cancer dataset2000 genes62 samples
b.
BioDCV Outlier detection
Compare subgroups and pathological features
213
Sam
ple
s198
Sam
ple
s
Complete dataset
Shaved datasetComplete dataset
213
Sam
ple
s
BioDCV
Usage (examples)Usage (examples)
Example of Semisupervised analysis (Sese)Example of Semisupervised analysis (Sese)
The pros:The pros:
MPI execution on the GRID in a few days.. MPI execution on the GRID in a few days.. The tests showed scalable behavior of our grid application for The tests showed scalable behavior of our grid application for
increasing numbers of CPUsincreasing numbers of CPUs Grid computing reduces significantly production times and allows to Grid computing reduces significantly production times and allows to
tackle larger problems (see next slide) tackle larger problems (see next slide)
The cons: The cons:
DData ata movements movements limit the scalabilitylimit the scalability for a large number of CPU’s for a large number of CPU’s Note: this is a GRID.it limitation: there is no shared Filesystem Note: this is a GRID.it limitation: there is no shared Filesystem
between the WNs, so each file needs to be copied everywhere! between the WNs, so each file needs to be copied everywhere!
To hide the latency (ideas):To hide the latency (ideas): Smart data distribution from MWN to WN’s:Smart data distribution from MWN to WN’s:
Reduce the amount of data to be movedReduce the amount of data to be moved Proportionate BioDCV subtasks to local cacheProportionate BioDCV subtasks to local cache
Data transferred via MPI communication Data transferred via MPI communication Requires MPI coding and some MPI adaptation of the code)Requires MPI coding and some MPI adaptation of the code)
ResultsResults
MOVE no. 2: Improving the system MOVE no. 2: Improving the system
Reduce the amount of data to be movedReduce the amount of data to be moved
1.1. Redesign “per run”:Redesign “per run”: SVM models (about 200) andSVM models (about 200) and results, evaluationresults, evaluation Variables for semisupervised analysisVariables for semisupervised analysis
all managed within one data structureall managed within one data structure
2.2. A large part of the sampletracking semisupervised analysis,A large part of the sampletracking semisupervised analysis,is now managed within BioDCV (about 2000 files, 300MBis now managed within BioDCV (about 2000 files, 300MB) ) i.e. stored through SQLite. i.e. stored through SQLite.
3.3. Randomization of labels is fully automated Randomization of labels is fully automated
4.4. The SVM library is now an external library:The SVM library is now an external library: Modular use of machine learning methodsModular use of machine learning methods Now adding a PDA moduleNow adding a PDA module
5.5. BioDCV now under GPL (code curation …) BioDCV now under GPL (code curation …)
6.6. Distributed at BioDCV with a SubVersion server Distributed at BioDCV with a SubVersion server since September 2005since September 2005
1.1. At work on several clusters:At work on several clusters: MPBA-old: 50 P3 CPUs, 1GHzMPBA-old: 50 P3 CPUs, 1GHz MPBA-new: 6 Xeon CPUs, 2,8 GHzMPBA-new: 6 Xeon CPUs, 2,8 GHz ECT* (BEN): up to 32 (of 100) CPU Xeon, 2.8GHzECT* (BEN): up to 32 (of 100) CPU Xeon, 2.8GHz SISSA (Cozzini): up to 32 (of 60) P4, 2GHz, MyrinetSISSA (Cozzini): up to 32 (of 60) P4, 2GHz, Myrinet
2.2. GRID experiences GRID experiences A.A. Egrid “production grid” (INFN Padua): Egrid “production grid” (INFN Padua):
up to 64 (of 100) Cpu Xeon, 2-3GHzup to 64 (of 100) Cpu Xeon, 2-3GHzMicroarray data: Sarcoma, HS random, Morishita, Wang, …Microarray data: Sarcoma, HS random, Morishita, Wang, …
B.B. LESSONS LEARNED: LESSONS LEARNED:
i.i. the latest version reduces latencies (system times) due to file copying the latest version reduces latencies (system times) due to file copying and management and management CPU saturation CPU saturation
ii.ii. Life quality (and more studies): huge reduction of file installing and Life quality (and more studies): huge reduction of file installing and retrieving from facilities and WITHIN facilities retrieving from facilities and WITHIN facilities
iii.iii. Forgetting the severe limitation of file system (AFS, …)Forgetting the severe limitation of file system (AFS, …)iv.iv. Now installing 2.6 LCG2 (CERN realise September 2005) Now installing 2.6 LCG2 (CERN realise September 2005)
(OCTOBER 2005) CLUSTER AND GRID ISSUES(OCTOBER 2005) CLUSTER AND GRID ISSUES
INFRASTRUCTURE•MPACluster -> available for batch jobs•Connecting with IFOM -> 2005•Running at IFOM -> 2005/2006•Production on GRID resources
(spring 2005)
ChallengesChallenges
ALGORITHMS II
1. Gene list fusion: suite of algebraic/statistical methods
2. Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi-supervised analysis
3. New SVM Kernels for prediction on spectrometry data within complete validation
ALGORITHMS II
1. Gene list fusion: suite of algebraic/statistical methods
2. Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi-supervised analysis
3. New SVM Kernels for prediction on spectrometry data within complete validation
Challenges for predictive profilingChallenges for predictive profiling
•BASIC CLASSIFICATION: MODELS, lists, additional tools
•Tools for researchers: subtype discovery, outlier detection
•Connection to data (DB–MIAME)
•BASIC CLASSIFICATION: MODELS, lists, additional tools
•Tools for researchers: subtype discovery, outlier detection
•Connection to data (DB–MIAME)
Challenges (AIRC-BICG)Challenges (AIRC-BICG)
HPC-Interaction: access through webfront-ends to GRID HPCHPC-Interaction: access through webfront-ends to GRID HPC
www
Apa
che
AcknowledgmentsAcknowledgments
ITC-irst, Trento
Davide AlbaneseGiuseppe JurmanStefano MerlerRoberto FlorAlessandro Soraruf
ICTP E-GRID Project, Trieste
Angelo LetoCristian Zoicas
Riccardo MurriEzio CorsoAlessio Terpin
ITC-irst, Trento
Davide AlbaneseGiuseppe JurmanStefano MerlerRoberto FlorAlessandro Soraruf
ICTP E-GRID Project, Trieste
Angelo LetoCristian Zoicas
Riccardo MurriEzio CorsoAlessio Terpin
IFOM-FIRC and INT, Milano
James ReidManuela GariboldiMarco A. Pierotti
Grants:
BICG (AIRC)Democritos
Data:
ShoweLab (Wistar)Cardiogenomics PGA
IFOM-FIRC and INT, Milano
James ReidManuela GariboldiMarco A. Pierotti
Grants:
BICG (AIRC)Democritos
Data:
ShoweLab (Wistar)Cardiogenomics PGA
DTW based clusteringDTW based clustering8
1a
41
a4
0a
12
7a 75
a5
2a
30
06
4a
13
0a6
8a
22
70
a2
9a
10
5a5
45
1a
67
a 89
a7
9a
88 12
49
6a
87
a6
3a
97
a6
28
6a
49
a3
01a 2
06
61
18 3
02
24
12
8a8
0a
12
31
16
47
a4
5a
93
a9
8a
19
13
21
06a
50
a8
2a 21
a9
2a
83
a1
12
42
a1
19a
72
a5
7a
48
a2
53
6a
58
a6
9a
10
31
21
12
2a3
09
51
44a
56
a9
0a 5
5a
59
a9
9a
43
a3
7a
44
a8
51
00a
11
11
43a
61
a1
72
7a
12
58
41
09
12
01
04a
13
79
1a
14
11
7 13
33
0.00
00.
002
0.00
40.
006
0.00
8
Positive
Hei
ght
Curves were clustered with Curves were clustered with Dynamic Time Warping, as Dynamic Time Warping, as distance with weight distance with weight configuration (1,2,1).configuration (1,2,1).
Subgroup and pathological featuresSubgroup and pathological features11
6
47a
45a
93a
98a
19
132
106a
0.00
060.
0008
0.00
100.
0012
0.00
14
Positive
Hei
ght
DTW based clustering of sample DTW based clustering of sample tracking profiles of cluster tracking profiles of cluster subgroup.subgroup.
All samples have virus type B All samples have virus type B (V-B). (V-B).
Bottom lines: incidence in the Bottom lines: incidence in the cluster subgroup and in all the cluster subgroup and in all the positive samples. positive samples.
SampleSample Virus Virus type Btype B
Virus Virus Type CType C
132 0 1
19 0 1
45a 0 1
47a 0 1
106a 0 1
93a 0 1
98a 0 1
116 0 1
% 0 100
All(%) 18 65
Predictive errorPredictive error
Number of features
AT
E
0
5
10
15
20
25
30
1 5 50 150 500 1993
Complete datasetShaved dataset
Confidence interval (.95 level)