Upload
caden
View
28
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Ontology-based Workflow Designer. Ontology Assistant browsing querying. WF Editor composition browsing selection visualization. WS 1. WS 1. WS 1. WS 1. Network. WS 2. WS 2. WS 2. WS 2. Spectra PreprocessingServices. Spectra Preparation Services. Spectra Management Services. - PowerPoint PPT Presentation
Citation preview
INTRODUCTION
We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation platform for predictive analysis. We leverage from two existing software platforms (MS-Analyzer and BioDCV) and from emerging proteomics standards. In the set-up, BioDCV is accessed from the MS-Analyzer workflow as a service, thus providing a complete pipeline for proteomics data analysis. Predictive classifica-tion studies on MALDI-TOF data based on this pipeline are presented.
REFERENCES
[1] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid,FGCS, 2006, In press, http://dx.doi.org/10.1016/j.future.2006.04.011
[2] M. Cannataro, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri. Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE CBMS 2005: 549-554
[3] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2):110-118, 2005.
[4] A.Barla, B.Irler, S.Merler, G. Jurman, S.Paoli and C. Furlanello, Proteome profiling without selection bias. IEEE CBMS 2006, 941—946
[5] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Kong, and Q. Le. Sample classification from protein mass spectometry, by ”peak probability contrasts”. Bioinformatics, 20(17):3034–3044, 2004.
Workflows, ontologies and standards for unbiased prediction in high-throughput proteomics
Cannataro M*, Barla A**, Gallo A*, Paoli S**, Jurman G**, Merler S**, Veltri P*, Furlanello C**. *University Magna Graecia of Catanzaro, Italy, **ITC-irst, Trento, Italy
MGED MGED 99
September 7-10, 2006 Seattle, WA,
U.S.A.
DATASETS
D1. MALDI-TOF Ovarian Cancer Dataset, from (www-stat.stanford.edu/~tibs/PPC/ Rdist)[5]
• 49 samples (24 diseased + 25 controls)• Each raw sample has 56384 m/z
measurements (892 KB)• Each preprocessed sample has
564 m/z measurements (19 KB)• Preprocessing:
• Normalization• Binning
• Biomarker identification• Baseline subtraction• Peak Alignment – Clustering• 67 features identified
D2. (lab calibration sample) MALDI-TOF, human serum, 20 technical
replicates, 10 control samples, 10 with 2 proteins, 34671 measurement, 347 m/z after preprocessing, predictive discrimination with 7 peaks
MS-ANALYZER
MS-Analyzer [1] is a platform for the integrated management and processing of proteomics spectra data. It supports the ontology based design of “in silico” proteomics studies: ontologies are used to model software tools and spectra data, while workflows are used to model applications. MS-Analyzer uses a specialized spectra database and provides a set of pre-processing services:
• Interface to heterogeneous mass spectrometers formats such as MALDI-TOF, SELDI-TOF, ICAT-based LC-MS/MS. Formats are unified into mzData, in compliance with the HUPO-PSI proteomics standardization initiative.
• Acquisition, storage, and management of MS data with the SpecDB database. Spectra are stored in their different stages (raw, pre-processed, prepared). Single, multiple, or portions of spectra can be queried (in-database preprocessing).
• Preprocessing of MS data (smoothing, baseline subtraction, normalization, binning, peaks alignment), as well as spectra preparation for further data mining (spectra to ARFF conversion) [2].
• Sharing of experiments data, workflows and knowledge
WS
RSR PPSRPSR
raw spectra
pre-processedspectra
preparedspectra
SpecDB APIs
Ontology-based Workflow Designer
Ontology Assistant- browsing- querying
WF Editor-composition-browsing-selection-visualization
WF SchemaAbstract,
Concrete WF
ResourceDiscoveryServices
WF Translator
WF Scheduler
WF Monitor
Workflow Scheduler
Ontology manager
Ontologies
UDDI/MDS
MetadataWSDL
WS1
WS2
Spectra Management
Services
Network
WS1
WS2
Spectra Visualization
Services
WS1
WS2
Spectra Preparation
Services
WS1
WS2
Spectra Preprocessing
Services
11
M-WS
Ontology-based Workflow Designer
BIODcv WS
BioDCV WSfront-end
Server
FTP repositoryFTP repository
• Data• Metadata
• Repository URL• email
• DMZ Server
Apachemod_Python ZSI module
BIODCV
The predictive modeling portion of the proposed system is provided by BioDCV, the ITC-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E-RFE, an entropy based acceleration of the SVM-RFE feature ranking procedure [3].
For proteomics, it includes methods for baseline subtraction, spectra alignment, peak clustering and peak assignment that were adapted from existing R packages and concatenated to the complete validation system.
BioDCV is also a grid application and it has been used in production within the EGEE Biomed VO [4].
FEATUREEXTRACTION
• Within sample
• across sample
Complete Validation
R scripts
• visualizationATE, sampletracking
PHP
• biomarker lists
• HTML publication
• Biomarkers data• REPORT
ACKNOWLEDGMENTS
• ITC-irst: R Flor, D Albanese, B Irler • UniCZ: G. Cuda, M. Gaspari, PH Guzzi,T
Mazza
Three Internet Web Services are used to integrate remotely the two main system components.
The BioDCV component is invoked from the MSAnalyzer workflow as a WebService (biodcv-ws-client) in the UniCZ network: data and metadata are copied in a FTP repository, then the data URL and a notification email address are transmitted to the BioDCV WebService (biodcv-ws) on a DMZ area of the ITC-irst network.
This service is directly run by Apache with Mod_Python and the Zolera Soap infrastructure. The incoming data are transferred to the internal front-end server (server-cz-tn.py) within the firewalled area.
The front-end launches first the feature extraction module and then a full complete validation process using the BioDCV component. The system outputs are thus formatted as graphs and tables by R and PHP scripts. The results are published by the front-end on the DMZ server, and notified back by email.
WEB SERVICESARCHITECTURE
n
AT
E
10
20
30
40
1 5 10 15 20 30 40 50 67
Number of features
E(S
)
0.0
0.5
1.0
1 5 50n1
1: S0 (26)
1 5 50n1
2: S1 (28)
1 5 50n1
3: S2 (27)
1 5 50n1
4: S3 (25)
1 5 50n1
5: S4 (26)
0.0
0.5
1.0
1 5 50n1
6: S5 (35)
1 5 50n1
7: S6 (19)
1 5 50n1
8: S7 (32)
1 5 50n1
9: S8 (31)
1 5 50n1
10: S9 (30)
0.0
0.5
1.0
1 5 50n1
11: S10 (24)
1 5 50n1
12: S11 (22)
1 5 50n1
13: S12 (22)
1 5 50n1
14: S13 (24)
1 5 50n1
15: S14 (20)
0.0
0.5
1.0
1 5 50n1
16: S15 (27)
1 5 50n1
17: S16 (24)
1 5 50n1
18: S17 (22)
1 5 50n1
19: S18 (26)
1 5 50n1
20: S19 (18)
0.0
0.5
1.0
1 5 50n1
21: S20 (27)
1 5 50n1
22: S21 (25)
1 5 50n1
23: S22 (19)
1 5 50n1
24: S23 (21)
1 5 50n1
25: S24 (23)
Error rate (tumour tissue)
Error rate (non- tumoural tissue)
No-information error rate
11
The BioDCV system: EGEE BioMed VO
2-50 MB
50-400 MB
grid-ftp
scpgrid-ftp
grid-ftp
grid-ftp
scp
Commands:1.grid-url-copy/lcg-cp db from local to SE2.edg-job-submit BioDCV.jdl3.grid-url-copy/lcg-cp db from SE to local
D2: mean A
m/z
Inte
nsity
9100 9120 9140 9160 9180 9200
01
000
200
03
000
400
0 D2: .95 Student bootstrap CI
D2: mean B
D2: .95 Student bootstrap CI
9133,17 Da