Proteome data integration characteristics and challenges

Proteome data integrationProteome data integrationcharacteristics and challengescharacteristics and challenges

Proteome data integrationProteome data integrationcharacteristics and challengescharacteristics and challenges

K. Belhajjame1, R. Cote4, S.M. Embury1, H. Fan2, C. Goble1, H. Hermjakob, S.J. Hubbard1, D. Jones3, P. Jones4, N. Martin2, S. Oliver1,

C. Orengo3, N.W. Paton1, M. Pentony3, A. Poulovassilis2, J. Siepen, R.D. Stevens1, C. Taylor4, L. Zamboulis2, and W. Zhu4

1University of Manchester2Birkbeck College

3University College London4European Bioinformatics Institute

All Hands Meetings, 2005 2

OutlineOutline

Experimental proteomics

ISPIDER architecture

Example use cases

Conclusion


Separation

Protein digestion

Mass Spectrometry

Experimental proteomicsExperimental proteomics

An essential component for elucidation of the biological functions of proteins The study of the set of proteins produced by an organism with the aim of understanding their behaviour under varying conditions Protein DB

2D gel electrophoresis

Maldi TOF

Enzymatic digestion

Identification

Protein ID


Experimental proteomicsExperimental proteomics

Development of new technologies for:

– protein separation (2D-SDS-PAGE, HPLC, Capillary

Electrophoresis)

– mass spectrometry (Multi-Dimensional protein identification)

Availability of publicly accessible protein sequence

databases

Proteomics databases (PedroDB, gpmDB, PepSeeker,

Pride, …)

Building experiments involving analysis services orchestration and data processing and integration


Objectives of ISPIDERObjectives of ISPIDER

A Grid dedicated to the creation of bioinformatics

experiments for proteomics

Develop, or make, existing Proteome databases and

Grid-enabled services

Develop Middleware support for developing and

executing new proteome analyses, based on distributed

query processing and workflow technologies

Undertake proteomic studies that demonstrate the

effectiveness of the resulting infrastructure


OutlineOutline



Example use cases

Conclusion and future directions


ISPIDERISPIDER

ExistingExistingE-ScienceE-ScienceInfrastructureInfrastructure

ISPIDERISPIDERProteomics GridProteomics GridInfrastructureInfrastructure

ISPIDERISPIDERProteomics Proteomics ClientsClients

PublicPublicProteomicsProteomicsResourcesResources

ProteomeRequestHandler

InstanceIdent/Mapping

Services

ProteomicOntologies/

Vocabularies

SourceSelectionServices

DataCleaningServices

myGridOntologyServices

myGridDQP

AutoMedmyGrid

Workflows

KEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data, PS = protein structure, PF = protein family, FA = functional annotation, PPI = protein-protein interaction data, WP = Work Package

VanillaQuery Client

2D GelVisualisatio

nClient + Aspergil.

Extensions

+ Phosph.Extensions PPI Validation

+ Analysis Client

Protein ID Client

Web services

Existing Resources

PS

WS

PF

WS

TR

WS

GS

WS

FA

WS

PPI

WS

PID

WS

PRIDE

WS

PEDRo

WS

ISPIDER Resources

Phos

WS


OutlineOutline



Example use cases

Conclusion and future directions


MotivationMotivation

Protein identification experiments are usually used as input into further analysis processes.

– Gathering evidence for a biological hypothesis

– Suggesting new hypotheses

ObjectiveObjectiveAugment the identification results with additional information on the identified protein

ImplementationImplementationTaverna workflow system

Value-added protein datasetsValue-added protein datasets


Value-added protein datasetsValue-added protein datasets

PepMapper Web Service

GO Services

Auxiliary Services


Genome-focused protein identification



Currently, protein identification searches performed over large data

sets. This means fewer false negatives, but false positives are also

more likely.

ObjectiveObjective

More focused and thus more efficient protein identification

ImplementationImplementation

Taverna workflow system

DQP, a service-based query processor




DQP Web Service

IPI

PepMapper web service

GOA Web Service

select p.Name, p.Seqfrom p in db_proteinSequenceswhere p.OS='HomoSapiens';


Integrated access to proteome databases



Ability to analyse existing proteomics results en masse is limited,

because of the heterogeneities between the schemas of the different

databases

ObjectiveObjective

Providing integrated access to proteome databases through a

common schema

ImplementationImplementation

AutoMed, a framework for mapping heterogeneous schemata

DQP, a service-based query processor




Automed Wrappers

PRIDEPedroDBgpmDB

Automed Repository

OGSA-DAIActivity

OGSA-DAIActivity

OGSA-DAIActivity

OGSA DistributedQuery Processor

AutomedQuery Processor

AutomedDQP Wrapper

User query

Result

OQL query

OQL result


ConclusionsConclusions

+ Available e-science technologies provide rapid prototyping facilities for bioinformatics analyses

+ Combining such technologies is possible and opens up more possibilities Taverna + DQP Automed + DQP

- Writing custom code is usually required– Processing service output to extract inputs for following services – Transforming results between data formats– Dealing with mismatches between identifiers

Developing a user-guided environment for the detection and resolution of mismatches

Development of Proteomics client applications (PepMapper, PepSeeker and PRIDE)

Documents

Proteome data integration characteristics and challenges