INTRODUCTION

INTRODUCTION

We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation platform for predictive analysis. We leverage from two existing software platforms (MS-Analyzer and BioDCV) and from emerging proteomics standards. In the set-up, BioDCV is accessed from the MS-Analyzer workflow as a service, thus providing a complete pipeline for proteomics data analysis. Predictive classifica-tion studies on MALDI-TOF data based on this pipeline are presented.

REFERENCES

[1] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid,FGCS, 2006, In press, http://dx.doi.org/10.1016/j.future.2006.04.011

[2] M. Cannataro, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri. Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE CBMS 2005: 549-554

[3] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2):110-118, 2005.

[4] A.Barla, B.Irler, S.Merler, G. Jurman, S.Paoli and C. Furlanello, Proteome profiling without selection bias. IEEE CBMS 2006, 941—946

[5] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Kong, and Q. Le. Sample classification from protein mass spectometry, by ”peak probability contrasts”. Bioinformatics, 20(17):3034–3044, 2004.

Workflows, ontologies and standards for unbiased prediction in high-throughput proteomics

Cannataro M*, Barla A**, Gallo A*, Paoli S**, Jurman G**, Merler S**, Veltri P*, Furlanello C**. *University Magna Graecia of Catanzaro, Italy, **ITC-irst, Trento, Italy

MGED MGED 99

September 7-10, 2006 Seattle, WA,

U.S.A.

DATASETS

D1. MALDI-TOF Ovarian Cancer Dataset, from (www-stat.stanford.edu/~tibs/PPC/ Rdist)[5]

• 49 samples (24 diseased + 25 controls)• Each raw sample has 56384 m/z

measurements (892 KB)• Each preprocessed sample has

564 m/z measurements (19 KB)• Preprocessing:

• Normalization• Binning

• Biomarker identification• Baseline subtraction• Peak Alignment – Clustering• 67 features identified

D2. (lab calibration sample) MALDI-TOF, human serum, 20 technical

replicates, 10 control samples, 10 with 2 proteins, 34671 measurement, 347 m/z after preprocessing, predictive discrimination with 7 peaks

MS-ANALYZER

MS-Analyzer [1] is a platform for the integrated management and processing of proteomics spectra data. It supports the ontology based design of “in silico” proteomics studies: ontologies are used to model software tools and spectra data, while workflows are used to model applications. MS-Analyzer uses a specialized spectra database and provides a set of pre-processing services:

• Interface to heterogeneous mass spectrometers formats such as MALDI-TOF, SELDI-TOF, ICAT-based LC-MS/MS. Formats are unified into mzData, in compliance with the HUPO-PSI proteomics standardization initiative.

• Acquisition, storage, and management of MS data with the SpecDB database. Spectra are stored in their different stages (raw, pre-processed, prepared). Single, multiple, or portions of spectra can be queried (in-database preprocessing).

• Preprocessing of MS data (smoothing, baseline subtraction, normalization, binning, peaks alignment), as well as spectra preparation for further data mining (spectra to ARFF conversion) [2].

• Sharing of experiments data, workflows and knowledge

WS

RSR PPSRPSR

raw spectra

pre-processedspectra

preparedspectra

SpecDB APIs

Ontology-based Workflow Designer

Ontology Assistant- browsing- querying

WF Editor-composition-browsing-selection-visualization

WF SchemaAbstract,

Concrete WF

ResourceDiscoveryServices

WF Translator

WF Scheduler

WF Monitor

Workflow Scheduler

Ontology manager

Ontologies

UDDI/MDS

MetadataWSDL

WS1

WS2

Spectra Management

Services

Network

WS1

WS2

Spectra Visualization

Services

WS1

WS2

Spectra Preparation

Services

WS1

WS2

Spectra Preprocessing

Services

11

M-WS

Ontology-based Workflow Designer

BIODcv WS

BioDCV WSfront-end

Server

FTP repositoryFTP repository

• Data• Metadata

• Repository URL• email

• DMZ Server

Apachemod_Python ZSI module

BIODCV

The predictive modeling portion of the proposed system is provided by BioDCV, the ITC-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E-RFE, an entropy based acceleration of the SVM-RFE feature ranking procedure [3].

For proteomics, it includes methods for baseline subtraction, spectra alignment, peak clustering and peak assignment that were adapted from existing R packages and concatenated to the complete validation system.

BioDCV is also a grid application and it has been used in production within the EGEE Biomed VO [4].

FEATUREEXTRACTION

• Within sample

• across sample

Complete Validation

R scripts

• visualizationATE, sampletracking

PHP

• biomarker lists

• HTML publication

• Biomarkers data• REPORT

ACKNOWLEDGMENTS

• ITC-irst: R Flor, D Albanese, B Irler • UniCZ: G. Cuda, M. Gaspari, PH Guzzi,T

Mazza

Three Internet Web Services are used to integrate remotely the two main system components.

The BioDCV component is invoked from the MSAnalyzer workflow as a WebService (biodcv-ws-client) in the UniCZ network: data and metadata are copied in a FTP repository, then the data URL and a notification email address are transmitted to the BioDCV WebService (biodcv-ws) on a DMZ area of the ITC-irst network.

This service is directly run by Apache with Mod_Python and the Zolera Soap infrastructure. The incoming data are transferred to the internal front-end server (server-cz-tn.py) within the firewalled area.

The front-end launches first the feature extraction module and then a full complete validation process using the BioDCV component. The system outputs are thus formatted as graphs and tables by R and PHP scripts. The results are published by the front-end on the DMZ server, and notified back by email.

WEB SERVICESARCHITECTURE

n

AT

E

10

20

30

40

1 5 10 15 20 30 40 50 67

Number of features

E(S

)

0.0

0.5

1.0

1 5 50n1

1: S0 (26)

1 5 50n1

2: S1 (28)

1 5 50n1

3: S2 (27)

1 5 50n1

4: S3 (25)

1 5 50n1

5: S4 (26)

0.0

0.5

1.0

1 5 50n1

6: S5 (35)

1 5 50n1

7: S6 (19)

1 5 50n1

8: S7 (32)

1 5 50n1

9: S8 (31)

1 5 50n1

10: S9 (30)

0.0

0.5

1.0

1 5 50n1

11: S10 (24)

1 5 50n1

12: S11 (22)

1 5 50n1

13: S12 (22)

1 5 50n1

14: S13 (24)

1 5 50n1

15: S14 (20)

0.0

0.5

1.0

1 5 50n1

16: S15 (27)

1 5 50n1

17: S16 (24)

1 5 50n1

18: S17 (22)

1 5 50n1

19: S18 (26)

1 5 50n1

20: S19 (18)

0.0

0.5

1.0

1 5 50n1

21: S20 (27)

1 5 50n1

22: S21 (25)

1 5 50n1

23: S22 (19)

1 5 50n1

24: S23 (21)

1 5 50n1

25: S24 (23)

Error rate (tumour tissue)

Error rate (non- tumoural tissue)

No-information error rate

11

The BioDCV system: EGEE BioMed VO

2-50 MB

50-400 MB

grid-ftp

scpgrid-ftp

grid-ftp

grid-ftp

scp

Commands:1.grid-url-copy/lcg-cp db from local to SE2.edg-job-submit BioDCV.jdl3.grid-url-copy/lcg-cp db from SE to local

D2: mean A

m/z

Inte

nsity

9100 9120 9140 9160 9180 9200

01

000

200

03

000

400

0 D2: .95 Student bootstrap CI

D2: mean B

D2: .95 Student bootstrap CI

9133,17 Da

http://dx.doi.org/10.1016/j.future.2006.04.011

Documents

INTRODUCTION